Table of Contents
import Bio
import json
import locale
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import platform
import shutil
import sys
import webbrowser
# Necessary to set font family for Latex in matplotlib:
from matplotlib import rc
locale.setlocale(locale.LC_ALL, '')
o_s = platform.system()
paths = {}
# to make this notebook's output stable across runs
np.random.seed(42)
if o_s == 'Darwin':
paths['data_dir'] = '/Users/drew/data/Bio'
paths['sra_tools_dir'] = '/Users/drew/Documents/Data/Bio/\
sratoolkit/bin'
paths['fastqc'] = '/Applications/FastQC.app/Contents/MacOS/fastqc'
paths['bbmap_dir'] = '/Users/drew/Documents/Data/Python/bbmap'
elif o_s == 'Windows':
paths['data_dir'] = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap'
paths['sra_tools_dir'] = r'C:\Users\DMacKellar\Documents\\
Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin'
paths['fastqc_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\FastQC'
paths['multiqc_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC'
paths['jdk_dir'] = r'C:\Program Files\Java\jdk1.8.0_101'
paths['bzip_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\Glimmer\Glimmer-master\src\main\java'
paths['bbmap_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\BBMap'
paths['phi_x'] = r'C:/Users/DMacKellar/Documents/Python/BioPython/BBMap/resources/phix174_ill.ref.fa'
paths['grch38_dir'] = os.path.join(paths['data_dir'], 'grch38')
# rc('font',**{'family':'serif','serif':['DejaVu Sans']})
rc('text', usetex=False)
for root, dirs, files in os.walk(paths['data_dir']):
for file in files:
file_under = str(file).replace('.', '_')
path = os.path.join(root, file)
paths[file_under] = path
sra_table = pd.read_csv(paths['Bmap_SraRunTable_txt'], sep='\t')
paths['credentials_json'] = os.path.join(paths['data_dir'], 'credentials.json')
with open(paths['credentials_json'], 'r') as f:
credentials = json.load(f)
Some notebooks and other Python code can be found online that offer inspiration/tips/demonstrations in bioinformatics:
A Zika project use case RNA-seq notebook
Another collection of such sites/notebooks
The NCBI instructions on FTP access to the SRA database
Galaxy's RNA-seq guide looks like a pretty thorough overview of the process
The point of this notebook is to brush up on some basic Bioinformatics data-wrangling tasks. I've sequenced a novel bacterial genome before, but a more common application relevant to future job searches is aligning RNAseq data, and thereby deriving tissue-specific gene expression patterns.
I got the initial data set from here. I downloaded the data files to:
C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq
on PC and:
/Users/drew/Documents/Data/Python/Galaxy_rnaseq
on Mac.
I'll need fastqc, bowtie, cufflinks, tophat, and stuff. The installation path for Ubuntu on Windows is:
C:\Users\DMacKellar\AppData\Local\Lxss\rootfs
Just the dependencies for fastqc is >300MB, so I'll want to remember this location and possibly uninstall the tools when I'm done with them.
Docs for fastqc are here.
The raw reads for the Illumina BodyMap 2.0 project are here. There are 48 files, each with somewhere around 8E7 spots (potentially representing that many transcripts), for about 6-8E9 Bp per file. A per-run listing is here.
RNA-seq alignment and quantification involves specific assumptions about methodology that are distinct from those of processing DNA reads for genome sequencing.
General info about the Illumina Body Map 2 is available from several sources. One note on methodology from this talk:
16 Human tissues are represented in the project, and for each tissue they ran one whole lane of a HiSeq2000 sequencer, mixing the paired-end and 1x75 sequences.
As a side project, they pooled total RNA from each of the 16 tissues, then subjected the pooled RNAs to each of three different treatments:
Apparently, the point of this latter approach was to validate a new protocol they had developed that was meant to enrich non-rRNA in a sample. From each of these pooled runs they collected 1x100bp reads per sequence.
Assessing sequence and base quality and trimming accordingly can be done with a combination of FastQC and Trimmomatic, or via BioPython.
We can get a preliminary look at the fastq format by simply opening the Bash shell and typing
less [filename.fastq]
In the case of the example files downloaded from the site listed above, the first three reads in 'Galaxy2-[adrenal_1.fastq\].fastqsanger' are:
@ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/1
ATCTTTTGTGGCTACAGTAAGTTCAATCTGAAGTCAAAACCAACCAATTT
+
5.544,444344555CC?CAEF@EEFFFFFFFFFFFFFFFFFEFFFEFFF
@ERR030881.311 HWI-BRUNOP16X_0001:2:1:18330:1130#0/1
TCCATACATAGGCCTCGGGGTGGGGGAGTCAGAAGCCCCCAGACCCTGTG
+
GFFFGFFBFCHHHHHHHHHHIHEEE@@@=GHGHHHHHHHHHHHHHHHHHH
@ERR030881.1487 HWI-BRUNOP16X_0001:2:1:4144:1420#0/1
GTATAACGCTAGACACAGCGGAGCTCGGGATTGGCTAAACTCCCATAGTA
+
55*'+&&5'55('''888:8FFFFFFFFFF4/1;/4./++FFFFF=5:E#
As can be seen, the reads are all cheek-to-jowl, with no lines separating them. The eye is drawn to the whitespace of the lines that contain only '+', but that line doesn't come between different reads, but rather separates the lines containing base pair calls and their respective quality scores. In fact, the line order goes:
Unique Read Identifier; details on interpretation available from Wikipedia. The first few characters that are in common among all reads in the file are usually either designate the specific machine on which they were sequenced or, as here, are the string under which the data were registered on the NCBI Sequence Read Archive. Specifically, in this case, the page is here.
Base Calls; I believe the only valid values are ['A', 'C', 'G', 'T', and 'N'].
A spacer line containing '+'. Despite appearance, it doesn't necessarily denote any kind of assumption about which strand of DNA in the chromosome from which the read originates; they're all supposed to be '+'.
Quality scores; base quality scores can be interpreted with the info here. The encoding apparently differs between platform, however, so attention must be paid to what machine/technology generated the reads.
After those four lines, the next read comes immediately, with no separating line.
I intend to try BioPython's SeqIO module later, but I am more familiar with FastQC, and will use it first to analyze these example data. FastQC can be run in a GUI mode, in an independent window. Opening a '.fastq' file within that automatically reads all sequences within and offers several tabs of summary data, and prints an icon beside the title of each tab that summarizes whether or not FastQC considers the output problematic for future integration into a mapping/assembly pipeline.
sra_table = pd.read_csv(paths['Bmap_SraRunTable_txt'], sep='\t')
sra_table.head()
sra_table.columns
# sra_table['organism_part'].unique()
sra_table[sra_table['organism_part'] == 'brain']
As the website listed above for this exercise notes, FastQC was mostly built around genome assembly applications, and so the quality scores and assumptions about suitability for each sequence or base may not always be appropriate for other purposes, such as RNA-seq. So the icons summarizing each tab aren't to be considered definitive.
from matplotlib import pyplot as plt
from matplotlib import image
import os
plt.close('all')
fastq3_dir = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\\
Galaxy3-[adrenal_2.fastq].fastqsanger_fastqc\Galaxy3-[adrenal_2.fastq].fastqsanger_fastqc\Images'
# fig = plt.figure(figsize=(20, 40))
my_dpi = 96
# plt.figure(figsize=(60, 80))#600/my_dpi, 800/my_dpi), dpi=my_dpi)
fig_scale = 500
fig, axs = plt.subplots(2, 1, figsize=(6*fig_scale/my_dpi, 8*fig_scale/my_dpi), dpi=my_dpi)
# ax1 = fig.add_subplot(211)
# ax2 = fig.add_subplot(212)
axs[0].imshow(image.imread(os.path.join(fastq3_dir, 'per_base_quality.png')))
axs[1].imshow(image.imread(os.path.join(fastq3_dir, 'per_sequence_quality.png')))
fig.suptitle('Galaxy3-[adrenal_2]', fontsize=0.08 * fig_scale)
# plt.tight_layout()
# axs[0].set_title('Galaxy3-[adrenal_2]')
plt.show()
I would like a general-purpose function to print specific graphs from each FastQC output. Once FastQC has run, it outputs a zip file. When extracted, this has the general structure
[dir the fastq file was in] / [dir with same name as the fastq file]\_fastqc / 'Images' / 'per_base_quality.png'
or
[dir the fastq file was in] / [dir with same name as the fastq file]\_fastqc / 'Images' / 'per_sequence_quality.png'
import os
import re
import matplotlib.pyplot as plt
from collections import defaultdict
def plot_fastqc(outer_dir):
fastqc_plots = defaultdict()
pattern1 = re.compile(r'(.*)\.fastq.*')
# get all files and subdirectories in the dir containing the fastq files
for root, dirs, files in os.walk(outer_dir):
for file in files:
# look for the 'per_base_quality' files
if file == 'per_base_quality.png':
# split the path to that file into drives/dirs
subdirs = root.split(os.sep)
# get just the part of the penultimate dir that's unique
# put it in the dict, make a subdict called per_base and set the
# image as its item
fastq_name = re.search(pattern1, subdirs[-2]).group(1)
fastqc_plots[fastq_name] = {}
fastqc_plots[fastq_name].update({'per_base' : os.path.join(root, file)})
elif file == 'per_sequence_quality.png':
subdirs = root.split(os.sep)
fastq_name = re.search(pattern1, subdirs[-2]).group(1)
fastqc_plots[fastq_name].update({'per_sequence' : os.path.join(root, file)})
elif file == 'per_base_sequence_content.png':
subdirs = root.split(os.sep)
fastq_name = re.search(pattern1, subdirs[-2]).group(1)
fastqc_plots[fastq_name].update({'per_base_sequence_content' : os.path.join(root, file)})
# elif file == 'kmer_profiles.png':
# subdirs = root.split(os.sep)
# fastq_name = re.search(pattern1, subdirs[-2]).group(1)
# blah = fastqc_plots.get(fastq_name, None)
# print('That worked.', fastqc_plots[re.search(pattern1, subdirs[-2]).group(1)])#, fastqc_plots[fastq_name])
# fastqc_plots[fastq_name].update({'kmer_profiles' : os.path.join(root, file)})
# blah.update({'kmer_profiles' : os.path.join(root, file)})
# fastqc_plots[fastq_name].update({'kmer_profiles' : os.path.join(root, file)})
else:
continue
# Now, plot the images
plt.close('all')
my_dpi = 96
fig_scale = 500
fig, axs = plt.subplots(3, len(fastqc_plots),
figsize=(6*fig_scale/my_dpi, 4*fig_scale/my_dpi),
dpi=my_dpi)
tick_params1 = { 'axis': 'x', 'which': 'both', 'bottom': 'off', 'top': 'off', 'labelbottom': 'off'}
for i, fastq_name in enumerate(fastqc_plots):
axs[0, i].imshow(plt.imread(fastqc_plots[fastq_name]['per_base']))
axs[0, i].axis('off')
axs[0, i].set_title(fastq_name, fontsize=fig_scale/17)
axs[1, i].imshow(plt.imread(fastqc_plots[fastq_name]['per_sequence']))
axs[1, i].axis('off')
axs[2, i].imshow(plt.imread(fastqc_plots[fastq_name]['per_base_sequence_content']))
axs[2, i].axis('off')
# axs[3, i].imshow(plt.imread(fastqc_plots[fastq_name]['kmer_profiles']))
plt.subplots_adjust(wspace=0.05, hspace=0)
plt.show()
return fastqc_plots
outer_dir = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
fastqc_plots = plot_fastqc(outer_dir)
Note: It's really strange; I had originally designed the plot_fastqc function to yield 4 graphs, but whether I try 'kmer_profiles' or 'duplication_levels' as the final field, it always returns an error; either about the key (which would be the same dict key as in the previous 3 fields) or saying 'Nonetype has no attribute 'update''. I copied and pasted the code that generates that key for each of the 'elif' statements, so there's no reason it could have gone wrong. I think there's something wrong with Python or IPython. In any case, for now, rather than re-working it, I'm just going to omit the fourth graph and move on.</font color='red'>
Plotting them this way makes the graphs a little small, but it is at least possible to see the overall trends. For instance, the 'Galaxy4' run was particularly poor, especially for the first 10 bases.
Another option is to take the FastQC output 'data.txt' file for each FastQ file, parse the info inside, and re-plot in Matplotlib. Actually, that's kind of ugly, since the FastQ file just returns summary stats (mean, median, quartiles), rather than raw data, and matplotlib box plots are calculated from raw data. You could fudge the summary stats into a form plottable by matplotlib, but it seems like extra work.
As an alternative, at this point, I feel like switching over to BioPython, rather than carrying through Trimmomatic. We'll see how that goes.
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'
def parse_fastq(start_dir):
biopy_fastqs = {}
for file in os.listdir(start_dir):
if file.split(sep='.')[-1] == 'fastqsanger':
name = file.split(sep='.')[0]+']'
file_path = os.path.join(start_dir, file)
biopy_fastqs[name] = {}
biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
'seqs': []})
for record in SeqIO.parse(file_path, "fastq"):
biopy_fastqs[name]['seqs'].append({'id': record.id,
'sequence': np.array(list(str(record.seq))),
'quality': np.array(record.letter_annotations['phred_quality'])})
biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
return biopy_fastqs
# biopy_fastqs = parse_fastq(start_dir_Mac)
biopy_fastqs = parse_fastq(start_dir_PC)
biopy_fastqs['Galaxy2-[adrenal_1]']['df'].head()
biopy_fastqs['Galaxy2-[adrenal_1]']['seqs'][:2]
Interestingly, on the Mac, that generates an error (warning?) about the IOPub data rate being exceeded. The notebook server docs output suggests setting a flag upon launching the server, which sounds like a pain. I found this suggestion about changing a config file. I'll try that now.
$ jupyter notebook --generate-config
Writing default config to: /Users/drew/.jupyter/jupyter_notebook_config.py
$ sudo nano /Users/drew/.jupyter/jupyter_notebook_config.py
Searched and found (line 155):
\# c.NotebookApp.iopub_data_rate_limit = 1000000
So I just added a line below it without the comment char, and with one more zero:
c.NotebookApp.iopub_data_rate_limit = 10000000
That appears to work; I don't get the warning now.
df1 = biopy_fastqs['Galaxy2-[adrenal_1]']['df']
df1.shape
df1.iloc[:2, 1].as_matrix()
What I'm looking for here is a quick way to iterate over the rows of this series, and create a new array with the same index of each row as a new row, etc. Or else, split them out into individual arrays, then perform a transpose operation and recombine them. A good start is just to check out the numpy array methods available.
Reshape might be suitable, maybe while specifying 'order="F"'? The problem there is that I don't know how to specify the 'newshape' arg. The array should have a shape of 50121 rows, 1 column, and each element in each cell is another array with shape (1, 50). I guess it's actually listed '(50,)'.
Ok, I think I've got it now; vstack then transpose seems to work. The problem appears to be that having a numpy array where each element within is itself an array is not the same as having a multidimensional numpy array. That is, nested arrays are not quite the same thing as a contiguous array. I guess.
See below for a demonstration of how Python interprets some of these structures. I don't fully understand it myself, but I knew that there must be a builtin method to flatten and then rearrange the data without using a 'for' loop.
q1 = df1.iloc[:, 1]
print('1\t', type(q1), q1.shape)
print('2\t', type(q1), q1.shape)
print('3\t', type(np.vstack(q1)), np.vstack(q1).shape, '\n')
print('4\t', type(q1.as_matrix()), q1.as_matrix().shape)
print('5\t', type(np.vstack(q1.as_matrix())), '\n')
print('6\t', q1.as_matrix()[:2], '\n')
print('7\t', np.vstack(q1.as_matrix()[:2]), '\n')
print('8\t', np.vstack(q1.as_matrix()[:2]).T[:5], '\n')
print('9\t', type(q1.as_matrix()[0]), q1.as_matrix()[0].shape)
print('10\t', type(q1[0]), q1[0].shape)
# The line below yields AttributeError:
# 'numpy.ndarray' object has no attribute 'as_matrix'
# print('10\t', type(np.vstack(q1.as_matrix()[0])), np.vstack(q1[0].as_matrix()[0]).shape)
type(q1)
So if I can get them in that format, the question now is the best way to build that into the existing function or add a new function. I suppose that ultimately the best way to handle this is to make it a class, but I'm still not that up on building those.
The 'per_base' scores are attributes of the entire FastQ file, not any particular sequence within it, so I'll put them in the higher-level dict first, then copy to the df. The ones I might care about are 'per_base_quality', 'per_base_sequence_content', and 'per_base_N_content'.
...Actually, that might not work well. The higher-order dict entries are all distinct entities for which there's no obvious 'round-these-all-up-in-numpy' function. I might have to filter it through pandas anyway, to be able to get them out as a 'to_matrix()' kind of functionality. Note: there probably should be a way to pass the sequence elements from the dict iteratively into a new array (perhaps using the fromiter method) before breaking them out into a single multidimensional array and transposing, but rather than learn about that I'll just try building the dataframe first, then using that to get the combined arrays. It's still the case that the output should be stored in the dict associated with each FastQ file, however; attempting to append these values to the pandas DF fails because the length of the DF's index doesn't match the length of the transposed array of 'per_base' scores.
For per_base_N_content, I want to take the vstack().T array and, per row in the array, output a count of just 'N' values. It sounds like the numpy.where() method might be best for that. Alternatively, maybe np.unique()?
As this post points out, the behavior of np.where() is, if the input array is 2D, as here, to output two arrays: the first contains a list of all rows where the condition is met, the second contains a list of all columns where the condition is met.
The two output arrays of np.where() should therefore be the exact same shape, a tuple of form '(x, )', where x is equal to the exact count of how many times the condition was met.
From that understanding, there's really no need to convert the sequence array into a per-base-seq array using the whole vstack().T approach. It should suffice to take the sequence of an entire FastQ file as an array, where each row is a read and each column is a position from 0 to 50 within each read, feed it through np.where(), take the second array of its output, which gives column values, then count how often each int appears, and plot that.
Actually, the approach listed in the bullet point immediately above sounds unnecessarily complex. It should be easier to combine 'np.vstack()' with 'np.unique( , return_counts=True, axis=1)'... but for some reason that keeps the elements all separate; it outputs in an unexpected format. I see now that I was using 'np.nditer()', and by taking a closer look at these docs, I see that that method is meant not to iterate over just rows within an array, but over individual values. It looks like numpy is still meant to use for loops to iterate over entire rows within an array.
For per_base_sequence_content, I'll probably want a variant of the same output: the where() method might work.
For per_base_quality, I can just leave the scores as is, then call a box-and-whiskers plot on them later.
The only 'per_sequence' score that I might want is probably 'per_sequence_quality' (average quality per read).
Actually, for each FastQ file, I'd also like to know how many duplicates are present; that seems like potentially useful info. Since these files are RNAseq data, and may represent truly useful info about sequence abundance in the original sample, I probably won't end up deleting duplicates, but it still seems like important info to track. The interpretation of the equivalent example from FastQC is a tad nuanced, but I think I can get through it.
The raw value_counts are not super graph-friendly, because of how large the number of one-offs are present; I could try log-log scales (or semi-log), but FastQC's choice of bins of 1-10, 50, 500, and 5,000 counts might be a good format to copy. As such, the pandas.cut() method would be a convenient way to apply them.
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict
start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'
def parse_fastq(start_dir):
biopy_fastqs = {}
phred_scale = 39 # Assuming Illumina data; change if diff qual score scale
for file in os.listdir(start_dir):
if file.split(sep='.')[-1] in ['fastqsanger', 'fastq']:
name = file.split(sep='.')[0]+']'
file_path = os.path.join(start_dir, file)
biopy_fastqs[name] = {}
biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
'seqs': []})
for record in SeqIO.parse(file_path, "fastq"):
biopy_fastqs[name]['seqs'].append({'id': record.id,
'sequence': np.array(list(str(record.seq))),
'seq_whole': str(record.seq),
'quality': np.array(record.letter_annotations['phred_quality'])})
biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
biopy_fastqs[name]['per_base_qual'] = np.vstack(biopy_fastqs[name]['df']['quality']).T
biopy_fastqs[name]['per_base_seq'] = np.vstack(biopy_fastqs[name]['df']['sequence']).T
# Count base calls to dicts; one for raw counts, one for percentages:
biopy_fastqs[name]['per_base_counts'] = []
biopy_fastqs[name]['per_base_percents'] = []
for i, row in enumerate(biopy_fastqs[name]['per_base_seq']):
biopy_fastqs[name]['per_base_counts'].append(defaultdict())
unique, counts = np.unique(row, return_counts=True)
biopy_fastqs[name]['per_base_counts'][i] = dict(zip(unique, counts))
for i, dictionary in enumerate(biopy_fastqs[name]['per_base_counts']):
biopy_fastqs[name]['per_base_percents'].append(defaultdict())
for k in dictionary.keys():
biopy_fastqs[name]['per_base_percents'][i][k] = dictionary[k] / sum(dictionary.values())
# Let's put the counts into a 'per_base_df':
biopy_fastqs[name]['per_base_seq_df'] = pd.DataFrame.from_dict(
biopy_fastqs[name]['per_base_counts']).fillna(value=0)
biopy_fastqs[name]['per_base_percents_df'] = pd.DataFrame.from_dict(
biopy_fastqs[name]['per_base_percents']).fillna(value=0)
biopy_fastqs[name]['per_base_qual_df'] = pd.DataFrame(
biopy_fastqs[name]['per_base_qual']).fillna(value=0)
# add values for duplication level
biopy_fastqs[name]['duplicates'] = biopy_fastqs[name]['df'].groupby(['seq_whole']).size()\
.value_counts().sort_index()
# add mean per_seq_qual
biopy_fastqs[name]['mean_seq_qual'] = pd.cut(biopy_fastqs[name]['df']['quality'] \
.apply(np.mean), range(1, 41)).value_counts().sort_index()
biopy_fastqs[name]['mean_seq_qual'].index = np.arange(2, phred_scale + 2)
return biopy_fastqs
biopy_fastqs = parse_fastq(start_dir_PC)
# biopy_fastqs = parse_fastq(start_dir_Mac)
for x in biopy_fastqs:
print(x)
print('\n')
for x in biopy_fastqs['Galaxy2-[adrenal_1]']:
print(x, type(biopy_fastqs['Galaxy2-[adrenal_1]'][x]))
per_seq_qual = biopy_fastqs['Galaxy2-[adrenal_1]']['df']['quality'].mean()
plt.hist(per_seq_qual)
plt.xlim([0, 40])
plt.show()
biopy_fastqs['Galaxy2-[adrenal_1]']['per_base_seq_df'].head()
biopy_fastqs['Galaxy2-[adrenal_1]']['per_base_percents_df'].head()
import matplotlib.pyplot as plt
biopy_fastqs['Galaxy2-[adrenal_1]']['per_base_percents_df'].plot()
plt.ylim([0, 1])
plt.show()
Now, put them together into a function. I think I'm going to need this info.
import matplotlib.pyplot as plt
def plot_fastq(biopy_fastqs):
batches = int(np.ceil(len(biopy_fastqs) / 4))
plt.close('all')
plt.style.use('ggplot')
my_dpi = 96
fig_scale = 500
# put at most 4 fastq files per row
fig, axs = plt.subplots(nrows=5*batches, ncols=4,
figsize=(4*fig_scale/my_dpi, 4*batches*fig_scale/my_dpi),
dpi=my_dpi)
for i, fastq_name in enumerate(sorted(biopy_fastqs)):
batch = int(np.floor(i / 4))
i2 = i % 4
axs[batch, i%4].set_title(fastq_name, fontsize=fig_scale/25)
axs[batch, 0].set_ylabel('Per base \nQuality', fontsize=fig_scale/25)
axs[batch+1, 0].set_ylabel('Per base \nSequence \nContent', fontsize=fig_scale/25)
axs[batch+2, 0].set_ylabel('Per base \nN Content', fontsize=fig_scale/25)
axs[batch+3, 0].set_ylabel('Per Sequence \nMean Quality\nScore', fontsize=fig_scale/25)
axs[batch+4, 0].set_ylabel('# of Seq. vs. \nDuplication \nLevel', fontsize=fig_scale/25)
biopy_fastqs[fastq_name]['per_base_qual_df'].T.plot(kind='box',
showfliers=False,
ax=axs[0, i],
sharey=True,
ylim=[0, 40])
biopy_fastqs[fastq_name]['per_base_percents_df'].plot(kind='line',
ax=axs[1, i],
sharey=True,
ylim=[0, 1])
biopy_fastqs[fastq_name]['per_base_percents_df']['N'].plot(kind='line',
ax=axs[2, i],
sharey=True,
ylim=[0, 0.1])
biopy_fastqs[fastq_name]['mean_seq_qual'].plot(kind='line',
ax=axs[3, i],
sharey=True)
axs[3, i].set_ylim(top=12000)
# axs[3, i].set_xticklabels(list(range(1, 41)))
# axs[3, i].set_yscale('log')
biopy_fastqs[fastq_name]['duplicates'].plot(kind='line',
ax=axs[4, i],
sharey=True)
axs[4, i].set_xscale('log')
axs[4, i].set_yscale('log')
axs[4, i].set_ylim(bottom=0.5)
plt.subplots_adjust(wspace=0.05, hspace=0.15)
plt.show()
plot_fastq(biopy_fastqs)
I think that those results look pretty similar to the FastQC results, but just to be sure that I preserved the right relationships when applying the operations to calculate the per_base scores, I'd like to compare the summary statistics of the per_base_qual from BioPython to those output from the FastQC_data files:
# This cell to set-up comparison of per_base_qual stats
# derived from FastQC software to those from my biopython wrangling
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict
start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'
def parse_fastqc_data(start_dir):
fastqc_data = defaultdict()
for root, subdirs, files in os.walk(start_dir):
for file in files:
if file == 'fastqc_data.txt':
name = os.path.split(root)[-1].split(sep='.')[0]+']'
fastqc_data[name] = {}
with open(os.path.join(root, file)) as f:
# Summary data for per_base_quality seems to be on lines 14-63
lines = [line.rstrip('\n') for line in f]
fastqc_data[name]['lines'] = []
for line in lines[13:63]:
fastqc_data[name]['lines'].append(line.split(sep='\t'))
fastqc_data[name]['compare'] = pd.DataFrame()
fastqc_data[name]['compare']['biopy'] = \
biopy_fastqs[name]['per_base_qual_df'].T.mean(axis=0)
fastqc_data[name]['compare']['fastqc'] = \
[float(stats[1]) for stats in fastqc_data[name]['lines'][:]]
fastqc_data[name]['compare']['ratio'] = (
fastqc_data[name]['compare']['biopy'] / fastqc_data[name]['compare']['fastqc'])
else:
continue
return fastqc_data
fastqc_data = parse_fastqc_data(start_dir_PC)
for name in fastqc_data:
print(name, fastqc_data[name]['compare']['ratio'].describe())
Ok, that was a bit of overkill, but it shows definitively that the biopython-based wrangling of the data into per_base statistics gives the same result in all cases as does FastQC.
So at this point I've ensured that I can rearrange data from BioPython's SeqIO module that can gather the same summary stats as the FastQC tool, and I've gotten a decent overall view of the associated plots. The problem is that the individual plots are a little too small within the IPython/Jupyter Notebook window to get all of the essential data out, so I'd like to generate an interactive plot that can enlarge individual subplots upon mousing over or clicking. (As a more minor point, if > 4 fastq files are present in the host dir, you could end up with too many columns to read, no matter how big you make the window, so there should be a line to wrap the axes into columns of 4 for future use.)
This could end up being a bit too long an undertaking at this point; I may earmark it for future improvement. Best bets for interactive visualization at this point appear to be Bokeh, which sounds like it's basically a Python wrapper for the D3.js (javascript-based) library, along with the associated tool Holoviews, which sounds like it's mostly meant to simplify some of the gruntwork of setting up a visualization using tools like Bokeh.
Once the visualization is suitably handled, the next task is trimming. As you can see from this example, the Galaxy2 reads are all pretty high quality, but some of the other runs are pretty bad. In the past, after checking reads with FastQC, I've run them through the Trimmomatic tool to cut off parts of reads that don't reach a quality threshold. Another option, to which I was directed from a forum discussing BioPython, is sickle; it's not native Python, however, so it probably wouldn't be any easier to use from within Python than Trimmomatic.
Actually, first it might be best to throw out entire reads if they're poor quality; such as those with a mean qual below, say, 30? Then, trim individual bases by quality score from the reads that pass that filter. When trimming, you obviously don't want to just delete the low-quality bases, since that may end up concatenating better bases on either side. Rather, delete the offending base and all those 5' or 3' to it, depending on whether you're trimming from the 5' or 3' end, both of which tend to have a higher proportion of low quality scores, while the middle of the read is generally better. I could define these as two different functions, one that reads progressively more 3' up to base #25, and one that starts at the 3' end and reads upstream until the midpoint of the read, as well.
Finally, since tiny reads are likely to map to multiple reference sequences ambiguously, I'd want to throw out any reads that were too short after trimming for quality. Say, <35bp in length.
When it comes to deciding upon quality cutoffs, length cutoffs, and other such important scores to decide in the trimming process, I find Biostars to be the website with the most useful discussions.
Actually, there's one more factor that I've been overlooking: these reads are from a paired-end dataset. I don't have any easy way to handle that in BioPython right now. Maybe I should stop re-inventing the wheel with this analysis and just bring in some other bioinformatics packages.
Trimmomatic expects paired reads to be fed in as separate files. Though I've been ignoring the biological details of the derivation of these data sets, it would appear that we in fact have two different sets of paired end reads from two separate libraries with these four files: the first few reads in Galaxy2's identifiers match that of Galaxy3; Galaxy2's have the '/1' suffix, and Galaxy3's have the '/2', so the order is what you'd expect.
Therefore, I can process these with Trimmomatic directly. I could write a Python-based pipeline to execute this, but that might be awkward given the need to switch to the Unix shell. Eh, I'll try it.
Note: on the PC, I tried 'sudo apt-get install trimmomatic' in the Ubuntu shell window, and got:
Selecting previously unselected package trimmomatic.
(Reading database ... 27039 files and directories currently installed.)
Preparing to unpack .../trimmomatic_0.35+dfsg-1_all.deb ...
Unpacking trimmomatic (0.35+dfsg-1) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up trimmomatic (0.35+dfsg-1) ...
as output. Checking the expected (default) install dir, I find the executable at:
/usr/bin/TrimmomaticPE</font>
Example input for PE:
java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
import sys, os
def trimmomatic_paired(biopy_fastqs, mac_pc='mac', **kwargs):
# Set the path for trimmomatic based on which laptop I'm using
rand_entry = list(biopy_fastqs.keys())[0]
if mac_pc == 'mac':
trim_path = '/Users/drew/Documents/Data/Bio/\
Trimmomatic-0.36/trimmomatic-0.36.jar'
commands = []
wd = os.path.dirname(biopy_fastqs[rand_entry]['path'])
sep = ';'
elif mac_pc == 'pc':
trim_path = r'C:\Users\DMacKellar\Documents\Python\\
BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar'
commands = []
wd = os.path.dirname(biopy_fastqs[rand_entry]['path'])
sep = ' & '
else:
print('Please specify "mac_pc" arg as "mac" or "pc"')
adapter_path = os.path.join(os.path.dirname(trim_path), 'adapters')
# Confirm that the file exists
try:
os.path.isfile(trim_path)
except:
print('trimmomatic.jar file not found at expected location')
sys.exit(1) # abort the function
# Confirm that the input dict
# contains an even number of fastq files
try:
len(biopy_fastqs) % 2 != 0
except:
print('Input has odd number of fastq files; \
\nPlease enter paired end data.')
sys.exit(1) # abort the function
use_kwargs = {}
# Specify valid kwarg keys and default values
default_kwargs = {
'phred': '-phred33',
'ILLUMINACLIP': '',
'LEADING': 'LEADING:25',
'TRAILING': 'TRAILING:25',
'SLIDINGWINDOW': '',
'MINLEN': 'MINLEN:35'
}
# For any input kwargs, pass them to the use_kwargs dict
for key, value in kwargs.items():
if key in default_kwargs:
use_kwargs[key] = value
if (key=='ILLUMINACLIP') & (value != ''):
illclip_split = value.split(sep=':')
adapter_file = os.path.join(adapter_path,
illclip_split[1])
try:
os.path.isfile(adapter_file)
illclip_join = ':'.join((illclip_split[0],
adapter_file,
*(x for x in illclip_split[2:])))
use_kwargs['ILLUMINACLIP'] = illclip_join
except OSError as err:
print('Adapter file {0} not found.'.format(adapter_file))
sys.exit(1) # abort the function
else:
print('kwarg key, value pair: {0}: {1} not understood;\
please specify key in (case-sensitive) manner.\
Default kwargs are: {3}'.format(key,
value,
default_kwargs))
# Initialize some default settings in case none specified
for key, value in default_kwargs.items():
if key not in use_kwargs:
use_kwargs[key] = value
print('kwarg key, value pairs used: ', use_kwargs)
# Format and output the relevant commands
fastq_files = [x for x in sorted(biopy_fastqs)]
fastq_paths = [biopy_fastqs[x]['path']
for x in sorted(biopy_fastqs)]
forwards = fastq_files[0::2]
forward_paths = fastq_paths[0::2]
reverses = fastq_files[1::2]
reverse_paths = fastq_paths[1::2]
for f, f_p, r, r_p in zip(forwards, forward_paths, reverses, reverse_paths):
trim_commands = ['java -jar', trim_path, 'PE', use_kwargs['phred'],
use_kwargs['ILLUMINACLIP'], use_kwargs['LEADING'],
use_kwargs['TRAILING'], use_kwargs['SLIDINGWINDOW'],
use_kwargs['MINLEN']]
os.chdir(wd)
to_insert = [f_p, r_p, '%s_trimmed_paired.fastq' % f,
'%s_trimmed_unpaired.fastq' % f,
'%s_trimmed_paired.fastq' % r,
'%s_trimmed_unpaired.fastq' % r]
for i, thing in enumerate(to_insert):
trim_commands.insert(3+i, thing)
trim_commands = ' '.join(trim_commands)
commands.append(trim_commands)
commands = sep.join(commands)
return commands
str.join?
commands = trimmomatic_paired(biopy_fastqs, mac_pc='mac')
commands
I copied the command above to the terminal on my Mac. It ran quickly, and output:
/Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger /Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy3-[adrenal_2.fastq].fastqsanger Galaxy2-[adrenal_1]_trimmed_paired.fastq Galaxy2-[adrenal_1]_trimmed_unpaired.fastq Galaxy3-[adrenal_2]_trimmed_paired.fastq Galaxy3-[adrenal_2]_trimmed_unpaired.fastq -phred33 LEADING:25 TRAILING:25 MINLEN:35
Multiple cores found: Using 2 threads
Input Read Pairs: 50121 Both Surviving: 45162 (90.11%) Forward Only Surviving: 3074 (6.13%) Reverse Only Surviving: 1163 (2.32%) Dropped: 722 (1.44%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
/Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy4-[brain_1.fastq].fastqsanger /Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy5-[brain_2.fastq].fastqsanger Galaxy4-[brain_1]_trimmed_paired.fastq Galaxy4-[brain_1]_trimmed_unpaired.fastq Galaxy5-[brain_2]_trimmed_paired.fastq Galaxy5-[brain_2]_trimmed_unpaired.fastq -phred33 LEADING:25 TRAILING:25 MINLEN:35
Multiple cores found: Using 2 threads
Input Read Pairs: 37992 Both Surviving: 29529 (77.72%) Forward Only Surviving: 2097 (5.52%) Reverse Only Surviving: 5045 (13.28%) Dropped: 1321 (3.48%)
TrimmomaticPE: Completed successfully
When I try copy-paste-running the above on the PC, however, I get the error:
Error: Invalid or corrupt jarfile /usr/bin/TrimmomaticPE
After Googling, I came across a post from my old collaborator Tony Bolger, saying that I need to add something about a partial URL of the source from which the program was obtained; something that I seem to remember seeing in my notes while I was at Harvard:
java -classpath <path to trimmomatic jar> org.usadellab.trimmomatic.TrimmomaticPE
Unfortunately, it seems that the executable at '/usr/bin/TrimmomaticPE' isn't the right file to pass in place of the '<path to trimmomatic jar>' placeholder in the command above. Since I used 'apt-get install trimmomatic' to get the program on my PC, I didn't know the actual location where that jar file could be found. So I went to root and used find:
cd /
sudo find . -name "*trimmomatic*jar"
That took a while to run because it scanned through all of the Windows-specific dirs, outputting 'permission denied' for tons of dirs, because the Ubuntu-specific password doesn't work for the Windows-controlled sectors of the hard drive. Anyways, it eventually found the jar file in:
./usr/share/java/trimmomatic-0.35.jar
But when I tried running that, it wanted the commands in a sufficiently different format that it would've been a pain in the ass to reprogram the whole function. Instead, I just chose to uninstall the distro I got with 'apt-get install', and grab the zip file from the same source as on the Mac. I put the zip file under the home dir in the Ubuntu structure; the jar will be at '~/Documents/BioPython/Trimmomatic-0.36/trimmomatic-0.36.jar'. Now the command formatting should be more compatible.
...Aaaand, the hell with that. Running this through the Ubuntu interface requires reformatting all of the 'path' values in the dict, from Windows-friendly to linux-friendly. I could do that with a couple of str.replace() calls, but realized that Trimmomatic should work just fine under native windows, too. Let's try that, instead. I grabbed the zip file from the same source as on the Mac. I put the zip file in my Documents structure; the jar will be at 'C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar'.
Ok. After a lot of fussing, I finally got the function to format the output correctly to function on the PC. Of course, when I go back to the Mac, I'll have to make sure that I didn't screw anything up too badly and it still functions there.
Not surprisingly, now that I reconsider the code, when I ran on the Mac, the output files went to /Users/drew/Documents/Data/Bio/FastQC, since that was the working dir of the terminal when I ran the commands. I had the function above change the working dir to that in which the input fastq files were found, but since I had written the function to just write out commands rather than executing them, changing the dir within python had no effect.
A better idea would be to run the function within the Python script, using the subprocess module. Anyways, for now, I'll point the fastq parser and plotting functions at those files and visualize the output.
...That returns:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
presumably because the 'np.vstack().T' calls within the 'biopy_fastq' function expect arrays all of the same size. Since Trimmomatic and other trimming tools cut out sequence content on the basis of individual base-call quality scores, the reads are now a distribution of many different sizes. I'll have to modify the function to pad the reads with 'NaN' values to all be the same length. These should be added just to the 3' end of each read, only used for manipulation within numpy, and should be removed again if I want to output the reads for any further processing outside of Python.
It sounds like 'np.pad()' is the way to do this, but I'm having a bit of trouble getting it to do exactly what I want: namely, pad variable-length rows to a common length, pad with the value 'np.nan', and only add to the right-hand side. This post might have some insight to offer.
Note: in the course of trying to alter the parse_fastq() function to address this issue, I commented out lines using the np.vstack() method and the any subsequent references to 'per_base' metrics, then ran the function. Surprisingly, on the PC that returns the aforementioned IOPub data rate error seen on the Mac. I'll try to address this using the same modification of the Jupyter config file as I did then.
When I try:
jupyter notebook --generate-config<br /><br />
on the PC, however, Windows opens a window asking what program I want to use to open this file. In other words, the file already existed. I chose Notepad++ to handle it, and the window indicates that the file opened is:
C:\Python\Lib\site-packages\jupyter.py<br /><br />
Its contents prior to modification are minimal:
"""Launch the root jupyter command"""<br />
if __name__ == '__main__':<br />
from jupyter_core.command import main<br />
main()<br /><br />
I saved the unmodified config file as:
C:\Python\Lib\site-packages\jupyter_bkup.py<br /><br />
I added the line:
c.NotebookApp.iopub_data_rate_limit = 10000000<br /><br />
at line 6, saved, and will reload the notebook server. The notebook reloaded fine, but still returned the iopub data rate warning. It seems that the two most likely explanations are that the alteration didn't take effect or that the increase in the data rate was insufficient. To test the first possibility, I'll reload the notebook using the argument from the command line, as the warning suggests, rather than relying upon the config file modification:
jupyter-notebook --NotebookApp.iopub_data_rate_limit=10000000<br /><br />
That does cause the cell to complete running without returning an error/warning, so it appears that the modification to the config file didn't have the desired effect. In order to ensure that, however, I reloaded the notebook server without specifying the command-line arg expansion of the data rate limit, and this time it does evaluate the cell without protest, so perhaps the first time I tried reloading after editing the config file was an anomaly, not the rule. For now, I will leave this issue, but should try to remember this episode if I encounter this warning again.</font>
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict
start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'
def parse_fastq(start_dir):
biopy_fastqs = {}
phred_scale = 39 # Assuming Illumina data; change if diff qual score scale
for file in os.listdir(start_dir):
if file.split(sep='.')[-1] in ['fastqsanger', 'fastq']:
name = file.split(sep='.')[0]
file_path = os.path.join(start_dir, file)
biopy_fastqs[name] = {}
biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
'seqs': []})
for record in SeqIO.parse(file_path, "fastq"):
biopy_fastqs[name]['seqs'].append(
{'id': record.id,
'sequence': np.array(list(str(record.seq))),
'seq_whole': str(record.seq),
'quality': np.array(record.letter_annotations['phred_quality'])})
biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
# biopy_fastqs[name]['per_base_qual'] = np.vstack(biopy_fastqs[name]['df']['quality']).T
# biopy_fastqs[name]['per_base_seq'] = np.vstack(biopy_fastqs[name]['df']['sequence']).T
# Count base calls to dicts; one for raw counts, one for percentages:
biopy_fastqs[name]['per_base_counts'] = []
biopy_fastqs[name]['per_base_percents'] = []
# for i, row in enumerate(biopy_fastqs[name]['per_base_seq']):
# biopy_fastqs[name]['per_base_counts'].append(defaultdict())
# unique, counts = np.unique(row, return_counts=True)
# biopy_fastqs[name]['per_base_counts'][i] = dict(zip(unique, counts))
# for i, dictionary in enumerate(biopy_fastqs[name]['per_base_counts']):
# biopy_fastqs[name]['per_base_percents'].append(defaultdict())
# for k in dictionary.keys():
# biopy_fastqs[name]['per_base_percents'][i][k] = dictionary[k] / sum(dictionary.values())
# Let's put the counts into a 'per_base_df':
# biopy_fastqs[name]['per_base_seq_df'] = pd.DataFrame.from_dict(
# biopy_fastqs[name]['per_base_counts']).fillna(value=0)
# biopy_fastqs[name]['per_base_percents_df'] = pd.DataFrame.from_dict(
# biopy_fastqs[name]['per_base_percents']).fillna(value=0)
# biopy_fastqs[name]['per_base_qual_df'] = pd.DataFrame(
# biopy_fastqs[name]['per_base_qual']).fillna(value=0)
# add values for duplication level
biopy_fastqs[name]['duplicates'] = biopy_fastqs[name]['df'].groupby(['seq_whole']).size()\
.value_counts().sort_index()
# add mean per_seq_qual
biopy_fastqs[name]['mean_seq_qual'] = pd.cut(biopy_fastqs[name]['df']['quality'] \
.apply(np.mean), range(1, 41)).value_counts().sort_index()
biopy_fastqs[name]['mean_seq_qual'].index = np.arange(2, phred_scale + 2)
return biopy_fastqs
# biopy_fastqs = parse_fastq(start_dir_Mac)
biopy_fastqs = parse_fastq(start_dir_PC)
for x in biopy_fastqs:
print(x)
import matplotlib.pyplot as plt
for x in biopy_fastqs['Galaxy2-[adrenal_1']: print(x)
biopy_fastqs['Galaxy2-[adrenal_1']['df']['seq_whole'].str.len().plot(kind='hist', bins=20)
plt.show()
Ok, so now I've gotten numpy/pandas to load data with different-length reads. The 'np.vstack()' method doesn't work with these, but I can try to play around with the np.pad() method.
blah = biopy_fastqs['Galaxy2-[adrenal_1']['df']['seq_whole'].apply(np.vstack)
print('{}\n{}'.format(blah.shape, blah.T.shape))
Ok, so np.vstack() actually works on the 'seq_whole' data, but not on quality scores or 'sequence' itself. Presumably this is because every entry in the 'seq_whole' series is an array with the same shape as every other: they have 1 element, a string, although the strings themselves have different lengths. By comparison, the 'sequence' and 'quality' series have arrays of different lengths.
So this would appear to be exactly where np.pad() could help.
...But I can't get it to pad all the elements to a common width. Instead, it seems that the pad function is only used to add a set number of elements to every element over which it iterates. Instead, I'll have to use a for loop to achieve what I want.
For some reason, whenever I call on pandas to describe the series of 'quality' data for galaxy2_trimmed_paired, it causes the notebook to hang up. Is this really beyond the memory capacity of the notebook, or my PC?
Calling the describe() method on up to 1000 rows works, although noticeably slower than for 100 rows. 10,000 rows doesn't seem like it will complete.
On the Mac, describing 100 rows takes about 161ms; 1,000 takes 12.4 seconds. Obviously this dataset is difficult for either laptop to work with.
NOTE: Since the parse_fastq() function below has been modified to interpret trimmed reads, and since trimmed reads aren't compatible with the plot_fastq() function I'd written, I had to 'pad' the sequence and quality scores, and store those as separate dict and Pandas DataFrame values.
The way that I have padded them is dependent upon the assumption that these are only meant for the interpretation of per-base and per-sequence stats within notebooks like this one, and that they won't be carried forward for writing new fastq files for use with other software in the analysis pipeline, which won't accept exotic values.
I padded the quality scores with 'np.nan', so that they wouldn't interfere with calculating stats about whole reads this also avoids any ambiguity about whether they came from the original quality scores assigned by the sequencer, or from the artificial process of padding them with this function for the purpose of internal analysis. But since the sequence values of 'N' are valid outputs from the sequencer, to avoid any ambiguity about whether a base call comes from that source or was introduced by padding here, I've chosen to alter the parse_fastq() function to pad with the base letter of 'Z'.
IF the output of any of these dicts will be carried forward using the padded values introduced here, either the parse_fastq() function should be modified again, perhaps to introduce '0' and 'N' values for 'np.nan' and 'Z', respectively, or else a downstream function should convert these values prior to writing to a new fastq file. Since I've chosen to keep the 'quality' and 'sequence' values separate in the dict from their respective 'quality_pad' and 'sequence_pad' values, however, I'm hoping this will prove unnecessary.</font>
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict
start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'
def parse_fastq(start_dir, read_length=50):
biopy_fastqs = {}
phred_scale = 39 # Assuming Illumina data; change if diff qual score scale
for file in os.listdir(start_dir):
if file.split(sep='.')[-1] in ['fastqsanger', 'fastq']:
name = file.split(sep='.')[0]
file_path = os.path.join(start_dir, file)
biopy_fastqs[name] = {}
biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
'directory': start_dir,
'seqs': []})
for record in SeqIO.parse(file_path, "fastq"):
quality = np.array(record.letter_annotations['phred_quality'])
quality_pad = np.append(quality, [np.nan]*(read_length-len(quality)))
sequence = np.array(list(str(record.seq)))
sequence_pad = np.append(sequence, ['Z']*(read_length-len(sequence)))
biopy_fastqs[name]['seqs'].append({'id': record.id,
'sequence': sequence,
'sequence_pad': sequence_pad,
'seq_whole': str(record.seq),
'quality': quality,
'quality_pad': quality_pad,
'description' : record.description,
'name': record.name})
biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
# Can use the np.vstack() method on quality scores and sequences
# Once we've brought all axes up to the same, standard length
biopy_fastqs[name]['per_base_qual'] = np.vstack(biopy_fastqs[name]['df']['quality_pad']).T
biopy_fastqs[name]['per_base_seq'] = np.vstack(biopy_fastqs[name]['df']['sequence_pad']).T
# Count base calls to dicts; one for raw counts, one for percentages:
biopy_fastqs[name]['per_base_counts'] = []
biopy_fastqs[name]['per_base_percents'] = []
for i, row in enumerate(biopy_fastqs[name]['per_base_seq']):
biopy_fastqs[name]['per_base_counts'].append(defaultdict())
unique, counts = np.unique(row, return_counts=True)
biopy_fastqs[name]['per_base_counts'][i] = dict(zip(unique, counts))
for i, dictionary in enumerate(biopy_fastqs[name]['per_base_counts']):
biopy_fastqs[name]['per_base_percents'].append(defaultdict())
for k in dictionary.keys():
biopy_fastqs[name]['per_base_percents'][i][k] = dictionary[k] / sum(dictionary.values())
# Let's put the counts into a 'per_base_df':
biopy_fastqs[name]['per_base_seq_df'] = pd.DataFrame.from_dict(
biopy_fastqs[name]['per_base_counts'])#.fillna(value=0)
biopy_fastqs[name]['per_base_percents_df'] = pd.DataFrame.from_dict(
biopy_fastqs[name]['per_base_percents'])#.fillna(value=0)
biopy_fastqs[name]['per_base_qual_df'] = pd.DataFrame(
biopy_fastqs[name]['per_base_qual'])#.fillna(value=0)
# add values for duplication level
biopy_fastqs[name]['duplicates'] = biopy_fastqs[name]['df'].groupby(['seq_whole']).size()\
.value_counts().sort_index()
# add mean per_seq_qual
biopy_fastqs[name]['mean_seq_qual'] = pd.cut(biopy_fastqs[name]['df']['quality'] \
.apply(np.mean), range(1, 41)).value_counts().sort_index()
biopy_fastqs[name]['mean_seq_qual'].index = np.arange(2, phred_scale + 2)
print('Added fastq file {0}'.format(name))
print('')
print('Finished parsing directory.')
return biopy_fastqs
biopy_fastqs = parse_fastq(start_dir_PC)
# biopy_fastqs = parse_fastq(start_dir_Mac)
Note: Up to this point in this notebook, I've only read in FastQ files and used BioPython to summarize their quality and modify their data without writing those data out to a file, as well as controlling Trimmomatic, which handles writing FastQ files itself. When I try to write out FastQ files below (after removing reads whose mean quality scores were below a threshold), I found out that BioPython's SeqIO module is built around the assumption that any in-Python manipulations will be relatively short blocks of code, and writing out would be relatively simple because the record (i.e., sequencing read)'s data will have remained associated with all of its relevant parts via the SeqIO module.
Since I've adopted this whole approach of using a parse_fastq() function, I've created a complicated dict that loses that association (but gained the advantage that all of the relevant data associated with the sequence object have been broken out into more familiar Python types). To write back to FastQ files, I can think of three main options, each with specific pros & cons:
Add the SeqIO record as a specific object associated with each 'seq' within each 'dict' output by the parse_fastq() function. This is the simplest to implement, but I'm not sure if it will greatly inflate the memory usage associated with the dict, and the respective time it would take to process the data in downstream operations. I'm thinking it might be significant, since it'd be essentially duplicating much of the other info in the dict.
Use any data manipulation from the parse_fastq()-output dict to just generate a list of sequence IDs, then refer back to the original input FastQ file to grab those IDs and write them out to a new file. That's less simple than the first option to implement, but would probably be more memory-efficient. A problem is that it would lose any potential modifications I make to any particular sequence's data within Python. That hasn't really come up yet; I've trimmed sequences with Trimmomatic, and padded them back out just to be able to more easily get summary data, but if I ever wanted to save any trim/pad info in the future, I'd have to find another workaround.
Ignore the intricacies of using the SeqIO module to write out, and just make new files with a custom format along the lines indicated at the start of this notebook: FastQ files are inherently ['seq1_id', '+', 'seq1_qual'] repeated over however many lines needed. I can certainly force that with the dicts I have (although some of the ID info was truncated, including the trailing '/1' / '/2' characters indicating which file is which read. Alternatively, as a modification of this approach, I could perhaps use the SeqIO module by setting the dict-filtered data for each read to create new SeqIO 'record' objects, then write these out to files, in case any intricacies of formatting that I haven't noticed are important.
I'll try the last approach first; even though it's the most involved, it is the most flexible.
import Bio.SeqIO
import datetime
import errno
import os
def make_subdirs(fastq_dict, out_dir_suffix=None):
# A function to create new subdirs, if needed,
# using datetime if no particular name is specified:
name_dir = {}
if out_dir_suffix != None:
suffix = str(out_dir_suffix)
elif out_dir_suffix == None:
suffix = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
# Get all of the directories
# associated with files in the dict
dirs = set()
for fastq_name in fastq_dict.keys():
dir_temp = fastq_dict[fastq_name]['directory']
dirs.add(dir_temp)
name_dir[fastq_name] = dir_temp
# Make name for new subdirs
for directory in dirs:
new_dir = os.path.join(directory, suffix)
# Associate FastQ files with new subdir name
for fastq_name in name_dir:
if name_dir[fastq_name] == directory:
name_dir[fastq_name] = new_dir
# Write the directories
try:
os.makedirs(new_dir)
print('Wrote {}'.format(new_dir))
except OSError as e:
if e.errno != errno.EEXIST:
raise
return name_dir
def write_fastq(fastq_dict, out_dir=None, out_dir_suffix=None):
# Initialize a dict to store assoc between
# FastQ files and the dirs they go to
name_dir = {}
# If user-specified outdir given
# (i.e., one dir to put all files in):
if out_dir != None:
for fastq_name in fastq_dict:
name_dir[fastq_name] = out_dir
# If the dir doesn't already exist, make it
if os.path.isdir(out_dir) == False:
try:
os.makedirs(out_dir)
except OSError as e:
if e.errno != errno.EEXIST:
raise
# If out_dir_suffix given (i.e., fastq_dict contains
# files in various dirs, and user wants a specific name
# associated with each subdir):
elif out_dir == None:
name_dir = make_subdirs(fastq_dict,
out_dir_suffix=out_dir_suffix)
# Or: I'm not using args correctly
else:
print('Interpretation error in arg "out_dir" or "out_dir_suffix".')
# Now, iterate through the fastq files in the dict
# write them to new SeqRecord objects, and write to files
for fastq_name in sorted(fastq_dict):
records = []
df = fastq_dict[fastq_name]['df']
for index, row in df.iterrows():
record = SeqIO.SeqRecord(seq=row['seq_whole'],
id=row['description'],
letter_annotations={'phred_quality': row['quality']},
description='')
records.append(record)
filename = os.path.join(name_dir[fastq_name],
fastq_name+'.fastq')
with open(filename, 'w+') as f:
SeqIO.write(records, handle=f, format='fastq')
print('')
print('{0} written to {1}'.format(fastq_name, filename))
print('')
print('Finished writing dict to fastq.')
dummy_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\dummy'
# write_fastq(biopy_fastqs, out_dir=dummy_dir_PC)
# write_fastq(biopy_fastqs, out_dir_suffix='blah')
# write_fastq(biopy_fastqs)
Ok, after a LOT of fiddling, I got that to work the way I wanted. At first I had tried referring to the dict entries for each seq to populate the output files, but indexing nested dicts is tricky, and finally I found it easier to grab the entries from the Pandas DataFrame instead.
I noticed that in the output fastq files, the formatting looked pretty OK, except that the first entry was like this:
@ERR030881.107 <unknown description>
CGGATTTCAGCTACTGCAAGCTCAGTACCACAGCCTCAAGCTCGAATGTG
+
HH;HHHHHGHHHHHHHHHHGHDHEHHHHHEHHHHBHHFHHHHHHHHHD0F
Whereas in the original Galaxy3-[adrenal_2.fastq].fastqsanger file it was:
@ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/2
CGGATTTCAGCTACTGCAAGCTCAGTACCACAGCCTCAAGCTCGAATGTG
+
HH;HHHHHGHHHHHHHHHHGHDHEHHHHHEHHHHBHHFHHHHHHHHHD0F
So it seems that the relevant parameter in which to save that additional info is the 'description' attribute for the SeqIO.SeqRecord object. Therefore I should be able to write that into the original parse_fastq() function and keep that info for each read (the info about the Illumina machine on which it was read is nonessential, but the info that keeps it paired with its opposite in another file is pretty relevant). I'll try making that change to the last instance of the parse_fastq() function above.
...That pretty much works, except that apparently BioPython's behavior with FastQ files is to parse the first line for each read with the stuff before the space being the 'record.id' feature, and the ENTIRE line as the 'record.description' feature. There's an additional feature it can grab from this row, called 'record.name', but that apparently ends up as the same as 'record.id'. So if I ask the fastq_write() function to write 'name' AND 'description', it'd end up just repeating the id in the same line. Instead, I'll just set the function to write the description as the name.
That worked, but the danged SeqIO still tags '<unknown description>' onto the end:
@ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/1 <unknown description>
ATCTTTTGTGGCTACAGTAAGTTCAATCTGAAGTCAAAACCAACCAATTT
+
5.544,444344555CC?CAEF@EEFFFFFFFFFFFFFFFFFEFFFEFFF
I'll modify the 'write_fastq()' function again to explicitly declare that the record.description should be None, or an empty string or something. Ok, None didn't work, but an empty string did. It didn't add any trailing whitespace to the line, and since a fastq file has limited formatting nuance, I would expect subsequent programs to treat it as legit.
Ok, I'm sure this could be improved upon, but that's not bad. I've finally hit the goal of automating the gathering and parsing of all fastq files in a dir, and plotting their summary characteristics in a manner that scales with the number of files. Actually, I'm noticing a flaw with the duplication level of the last file; the 'trimmed_unpaired' (last plot) doesn't have nearly as many reads, but the plotter still stretches the y axis to occupy the full space, even though I tried to specify that it should share the same y axis throughout the row. Eh, anyways, for now I'll skip running that down, and get down to optimizing trim parameters (I should probably add a function that summarizes read counts after trimming, and maybe build this into the trimmomatic caller), and run some small-scale alignments before submitting a big job to use the full dataset.
Actually, I just realized that I never applied a mean sequence quality cutoff prior to using Trimmomatic. I'll write a short function now to do that for all fastq files in a dict of parsed fastq files, so that you can get rid of any particularly low-quality reads, either before or after trimming based on per-base call quality scores.
Note: In order to get proper formatting for the output info about how many reads were dropped, I had to make extensive use of the 'print(''.format)' functionality in Python.
Also, in the course of making the write_fastq() function, I ended up deleting the trimmed fastq files in the source dir, so I'll regenerate those now using the trimmomatic_paired() function.
ACTUALLY, I just noticed from the plots above that the wonky ACGT percentages for the first 10 bases of each file are far too similar. Check out, specifically, how 'A' is elevated for positions ~2-5 and depressed for positions ~6-8 for all libraries, even after trimming. This suggests strongly some adapter sequences that didn't get trimmed. So I'll try adding parameters to the trimmomatic function and the corresponding call to look for these. The Trimmomatic docs state that 'TruSeq3' is the appropriate choice of FASTA input for all HiSeq and MiSeq runs, and the docs for the source reads states they were HiSeq.
But that doesn't find the adapters fasta file. I'll have to add more code to provide for the path to those files. These did indeed come with the install zip; on the PC they're at:
C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
...I tried adding that full path into the commands, but the problem is that the Trimmomatic tool interprets colons as separating important inputs within the ILLUMINACLIP command, so it ends up taking 'C:' as the fasta file with the adapter sequences. As a workaround, I'll have the program execute within the dir containing the adapters sequences.
import sys
import os
import subprocess
def trimmomatic_paired(fastq_dict, mac_pc='mac', subdir=True,
subdir_suffix=None, verbose=False, **kwargs):
# Set the path for trimmomatic based on which laptop I'm using
rand_entry = list(fastq_dict.keys())[0]
if mac_pc == 'mac':
trim_path = '/Users/drew/Documents/Data/Bio/\
Trimmomatic-0.36/trimmomatic-0.36.jar'
commands = []
sep = ';'
elif mac_pc == 'pc':
trim_path = r'C:\Users\DMacKellar\Documents\Python\\
BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar'
commands = []
sep = ' & '
else:
print('Please specify "mac_pc" arg as "mac" or "pc"')
adapter_path = os.path.join(os.path.dirname(trim_path), 'adapters')
os.chdir(adapter_path)
# Confirm that the trimmomatic jar file exists
try:
os.path.isfile(trim_path)
except:
print('trimmomatic.jar file not found at expected location')
sys.exit(1) # abort the function
# Confirm that the input dict
# contains an even number of fastq files
try:
len(fastq_dict) % 2 != 0
except:
print('Input has odd number of fastq files; \
\nPlease enter paired end data.')
sys.exit(1) # abort the function
use_kwargs = {}
# Specify valid kwarg keys and default values
default_kwargs = {
'phred': '-phred33',
'ILLUMINACLIP': '',
'LEADING': 'LEADING:25',
'TRAILING': 'TRAILING:25',
'SLIDINGWINDOW': '',
'MINLEN': 'MINLEN:30'
}
# For any input kwargs, pass them to the use_kwargs dict
for key, value in kwargs.items():
if key in default_kwargs:
use_kwargs[key] = value
# If adapter trimming requested,
# Confirm location of adapter fasta file
if (key=='ILLUMINACLIP') & (value != ''):
illclip_split = value.split(sep=':')
adapter_file = os.path.join(adapter_path,
illclip_split[1])
try:
os.path.isfile(adapter_file)
except OSError as err:
print('Adapter file {0} not found.'.format(adapter_file))
sys.exit(1) # abort the function
else:
print('kwarg key, value pair: {0}: {1} not understood;\
please specify key in a case-sensitive manner.\
Default kwargs are: {3}'.format(key,
value,
**default_kwargs))
# For any kwargs not fed to input, use defaults
for key, value in default_kwargs.items():
if key not in use_kwargs:
use_kwargs[key] = value
print('kwarg key, value pairs used: ', use_kwargs)
name_dir = {}
if subdir == True:
if subdir_suffix != None:
# Use user-defined subdir name
name_dir = make_subdirs(fastq_dict,
out_dir_suffix=str(subdir_suffix))
elif subdir_suffix == None:
# Make new subdir to put files in
name_dir = make_subdirs(fastq_dict,
out_dir_suffix='trim_L{0}_T{1}'
.format(use_kwargs['LEADING'].split(sep=':')[1],
use_kwargs['TRAILING'].split(sep=':')[1]))
elif subdir == False:
for fastq_name in fastq_dict:
name_dir[fastq_name] = fastq_dict[fastq_name]['directory']
# Format and output the relevant commands
fastq_paths = [fastq_dict[x]['path'] for x in sorted(fastq_dict)]
output_paths = [os.path.join(name_dir[x], x) for x in sorted(fastq_dict)]
forward_paths = fastq_paths[0::2]
reverse_paths = fastq_paths[1::2]
out_f_paths = output_paths[0::2]
out_r_paths = output_paths[1::2]
for f_p, r_p, f_outp, r_outp in zip(forward_paths, reverse_paths,
out_f_paths, out_r_paths):
trim_commands = ['java -jar', trim_path, 'PE', use_kwargs['phred'],
use_kwargs['ILLUMINACLIP'], use_kwargs['LEADING'],
use_kwargs['TRAILING'], use_kwargs['SLIDINGWINDOW'],
use_kwargs['MINLEN']]
to_insert = [f_p, r_p, '%s_trimmed_paired.fastq' % f_outp,
'%s_trimmed_unpaired.fastq' % f_outp,
'%s_trimmed_paired.fastq' % r_outp,
'%s_trimmed_unpaired.fastq' % r_outp]
for i, thing in enumerate(to_insert):
trim_commands.insert(4+i, thing)
trim_commands = ' '.join(trim_commands)
commands.append(trim_commands)
commands = sep.join(commands)
stdout = subprocess.run(commands, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, shell=True,
check=True)
if verbose==True:
for x in stdout.stdout.splitlines():
print(x.decode('utf-8'))
# return commands
return stdout
trim_kwargs = {'ILLUMINACLIP': 'ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10'}
stdout = trimmomatic_paired(biopy_fastqs,
mac_pc='pc',
verbose=True,
subdir=True,
**trim_kwargs)
trim_dir = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25'
trim1_dict = parse_fastq(trim_dir)
Hmm... Now Trimmomatic is returning flawed sequences. I've re-run the trim module four times, and whenever I try to parse the output, BioPython returns an error; thrice it was about differing lengths for the respective sequence and quality scores for some read. The read in question varies every time, and checking them in the fastq file indeed reveals that BioPython is correct. One time the parser returned 'whitespace is not allowed in the sequence'. I really don't know why I'm seeing this; I didn't have this problem the first time I wrote the 'trimmomatic_paired' function.
I'll try running manually from the command line to see whether there's some wrinkle being introduced via running from the notebook:
# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar PE -phred33 C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy2-[adrenal_1.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy3-[adrenal_2.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_forward_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_forward_unpaired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_reverse_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_reverse_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30
Ok, yes: that does execute just fine when input manually to the command line. BUT if I use the function to prepare the commands as a string, then copy and paste that into the command line, the output also fails to parse. So there's something specific about the difference in the code prepared by the function, not the effect of just running it from within the notebook.
The commands aren't formatted EXACTLY the same way, however; the manual one I did above has slightly different names given to the output fastq files, and each path contains double-backslashes. I'll try re-formatting the input command for a manual submission to more closely match the function's output, to try to narrow down possible sources of this difference:
# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar PE -phred33 C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy2-[adrenal_1.fastq].fastqsanger C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy3-[adrenal_2.fastq].fastqsanger C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_paired.fastq C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_unpaired.fastq C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_paired.fastq C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30
Aha, now THAT generates an error when parsing the output. There's something about these filename paths that's somehow causing Trimmomatic to behave erratically.
# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:/Users/DMacKellar/Documents/Python/BioPython/Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33 C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy3-[adrenal_2.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_paired.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_unpaired.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_paired.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30
Well, getting rid of all the backslashes doesn't help; it's still an error upon parsing. I'll try just changing the filenames.
# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:/Users/DMacKellar/Documents/Python/BioPython/Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33 C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy3-[adrenal_2.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/a.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/b.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/c.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/d.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30
That helps. But now I finally see the difference: it wasn't the path or filenames of the outputs above that mattered so much, it was the fact (that I didn't notice up until now) that running the function's output generates only two output files, both labeled 'Galaxy2...', instead of two each for Galaxy2 and Galaxy3. It seems that, when I modified the function above to generate output paths for the full dict, I didn't consider that it might have a different length than the input path lists.
Here's the code from the function (lines 95-117) that was causing the issue:
# Format and output the relevant commands
fastq_paths = [fastq_dict[x]['path'] for x in sorted(fastq_dict)]
output_paths = [os.path.join(name_dir[x], x) for x in sorted(fastq_dict)]
forward_paths = fastq_paths[0::2]
reverse_paths = fastq_paths[1::2]
for f_p, r_p, out in zip(forward_paths, reverse_paths, output_paths):
trim_commands = ['java -jar', trim_path, 'PE', use_kwargs['phred'],
use_kwargs['ILLUMINACLIP'], use_kwargs['LEADING'],
use_kwargs['TRAILING'], use_kwargs['SLIDINGWINDOW'],
use_kwargs['MINLEN']]
to_insert = [f_p, r_p, '%s_trimmed_paired.fastq' % out,
'%s_trimmed_unpaired.fastq' % out,
'%s_trimmed_paired.fastq' % out,
'%s_trimmed_unpaired.fastq' % out]
for i, thing in enumerate(to_insert):
trim_commands.insert(4+i, thing)
trim_commands = ' '.join(trim_commands)
commands.append(trim_commands)
commands = sep.join(commands)
When I add a print statement to get the length of each list being fed into the 'for f_p...' line, I get:
output_paths: 4 forward_paths: 2 reverse_paths: 2
So the function was writing 'Galaxy2..._trimmed_unpaired.fastq' out, then overwriting it with the output from trimming Galaxy3, and this was somehow screwing up the output in a way that made individual reads inconsistent. I'll fix that now, by subjecting the 'output_paths' variable to the same slicing that the 'fastq_paths' list gets.
Ok, that appears to have done the trick. Man, it's super annoying how long it can take to track down such a simple error when coding. Still, at least I was able to take the appropriate steps to keep things moving forward.
Anyways, the next step was going to be to filter the trimmed reads for average quality, but I'll skip that for now. This short post suggests that it's not very applicable in the preprocessing of sequence data.
BUT: The adapters are still there. Using the ILLUMINACLIP settings didn't change the per-base skew that's apparent when looking at the plots. I'm betting the Trimmomatic-supplied fasta files are out-of-date, or otherwise weren't the right ones for this dataset. After looking around online, I found a more comprehensive collection of contaminant sequences, and will try using that as an input file for trimmomatic. Unfortunately, the file's format isn't fasta, so I'll have to re-format it first.
import re
path1 = r'C:\Users\DMacKellar\Dropbox\Coding\\
Python\BioPython\idot_github_contaminant_list.txt'
path2 = r'C:\Users\DMacKellar\Dropbox\Coding\\
Python\BioPython\idot_github_contaminant_list.fasta'
path3 = r'C:\Users\DMacKellar\Documents\Python\\
BioPython\Trimmomatic-0.36\adapters\idot_github_contaminant_list.fasta'
with open(path1, 'r') as f:
lines = f.readlines()
pattern = re.compile(r'(\t)+')
illumina = []
for line in lines[17:160]:
substitutes = {' ': '_'}#, pattern.match(line): '\n'}
for key, value in substitutes.items():
if line == '\n':
continue
newline = '>'+line.replace(key, value)
newline = re.sub(pattern, '\n', newline)
illumina.append(newline)
with open(path3, 'w') as f2:
for line in illumina:
f2.write(line)
illumina[:3]
trim_kwargs = {'ILLUMINACLIP':
'ILLUMINACLIP:idot_github_contaminant_list.fasta:2:30:7'}
stdout = trimmomatic_paired(biopy_fastqs,
mac_pc='pc',
verbose=True,
subdir=True,
subdir_suffix='contaminant_trim',
**trim_kwargs)
trim2_dir = r'C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/contaminant_trim/'
trim2_dict = parse_fastq(trim2_dir)
%%time
plot_fastq(trim2_dict, per_row=4)
import matplotlib.pyplot as plt
allyourbase = pd.DataFrame()
for x in trim2_dict:
# trim2_dict[x]['per_base_percents_df'].A.plot(kind='line')
plt.plot(trim2_dict[x]['per_base_percents_df'].A, label=x[:20])
allyourbase = allyourbase.append(trim2_dict[x]['per_base_percents_df'].A)
plt.title('"A" Content per base\nin trimmed Reads')
plt.xlabel('Base position in reads')
plt.ylabel('%A')
ticks = allyourbase.sum().T.nlargest(3).index.tolist()
ticks.extend(range(20, 50, 10))
plt.xticks(ticks)
plt.grid(True, axis='x')
plt.legend()
plt.show()
Ok, that didn't get rid of the biased first few bases either. As far as I know, there's no compelling biological reason that the reads should all have A content that peaks at bases 3, 5, and 9 in every transcript. This is almost certainly a sequencing artifact.
Let's try a more direct approach: can I just look at a dozen or so sequences and see the commonalities in those first few bases?
biopy_fastqs['Galaxy2-[adrenal_1']['df']['sequence'][:10]
Yeah, I still don't see it. I don't know what's causing that base content skew. Maybe they're nothing to worry about, but it's a very noticable regularity.
Maybe I should check the raw data from the experiment, rather than this curated source? I doubt that it's so different, but it's worth noting that the Illumina BodyMap 2.0 project generated far, far more data than this limited dataset. I can't be certain that other aspects of the data weren't altered.
As noted at the top of this notebook, the full set of reads are available on NCBI's SRA. To interact with that repository, however, they require you to download their special software; the reads aren't available via any kind of structured URL query.
Binaries are available for Linux, Mac OS X, and Windows. Instructions are here. I put the unzipped output on the Mac at:
/Users/drew/Documents/Data/Bio/sratoolkit
On PC at:
C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin
Once those are installed, use the prefetch tool to download the sequences with the specific accession number. Or, perhaps that's unnecessary.
Then, you have to use their fastq-dump program to get the FastQ (or FASTA) file from the raw image data that includes the flow cell the way the sequencer saw it.
# (on Mac):
cd /Users/drew/Documents/Data/Python/Galaxy_rnaseq
curl ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/ERP000546
/Users/drew/Documents/Data/Bio/sratoolkit/bin/fastq-dump -X 5 -Z ERP000546
Hmm... those both return errors. Using the 'fastq-dump' command gives:
fastq-dump.2.8.2 err: item not found while constructing within virtual database module - the path 'ERP000546' cannot be opened as database or table
Using curl gives:
curl: (78) RETR response: 550
Which apparently means (~99% of the time) 'file not found'. So I'm having trouble figuring out how to download SRA data.
On PC, the trial code:
fastq-dump.exe -X 5 -Z SRR390728
does work, outputting the first 5 reads from that dataset. I didn't have to run the config for the tool. I'll try
cd C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\SRA_raw
C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin\fastq-dump.exe -X 10000 -A C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\accession.txt
Nope, that gives:
2017-11-10T03:01:31 fastq-dump.2.8.2 err: item not found while constructing within virtual database module - the path '/C/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/accession.txt' cannot be opened as database or table
Apparently just feeding a text file of accessions won't work with the '-A' flag.
Maybe this is a job for prefetch?
cd C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\SRA_raw
C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin\prefetch.exe --list C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\accession.txt
Nope; that wants a properly formatted 'kart file', and it sounds like the only way to get that is to go through the website. It sounds like they expect you to just enter multiple calls to prefetch or fastq-dump with one accession each. I guess that makes sense when many modern fastq raw read files are >1GB in size.
Note: you'll want to use the 'split-3' arg with fastq-dump in order to separate the F and R reads into separate files.
I think I'll try to do this via the subprocess module:
if o_s == 'Windows':
exe_ext = '.exe'
else:
exe_ext = ''
paths['fastq_dump'] = os.path.join(paths['sra_tools_dir'],
'fastq-dump'+exe_ext)
paths['prefetch'] = os.path.join(paths['sra_tools_dir'],
'prefetch'+exe_ext)
import subprocess
os.chdir(paths['data_dir'])
commands1 = '{} -X 5 -Z SRR390728'.format(paths['fastq_dump'])
commands2 = '{} -X 10MB SRR390728'.format(paths['prefetch'])
commands3 = [paths['fastq_dump'], '-X', '5', '-Z', 'SRR390729']
commands4 = [paths['fastq_dump'], '-X', '5', '-Z', 'SRR390729']
subprocess.run(commands4, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, shell=False, check=False)
def sra_download(accession_list, *args, **kwargs):
kwargs_new = []
for k, v in kwargs.items():
kwargs_new.append(k)
kwargs_new.append(str(v))
try:
out_dir
except NameError:
pass
else:
if not os.path.isdir(out_dir):
os.mkdir(out_dir)
os.chdir(out_dir)
print('Downloading {} files from the NCBI SRA server...'.format(
len(accession_list)))
for i, acc in enumerate(accession_list):
cmds = [paths['fastq_dump'], *args, *kwargs_new, str(acc)]
p = subprocess.Popen(cmds, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=False)
out, err = p.communicate()
if len(err) > 0:
print('\n{:>4} {} returned an error: \n{}\n'.format(
i+1, acc, err.decode('ascii')))
else:
print('{:>4} {} downloaded'.format(i+1, acc))
os.chdir(os.path.join(paths['data_dir'], 'split3'))
args = ['--split-3']
kwargs = {'-X': '10000'}
acc1 = ['SRR390728']
sra_download(acc1, *args, **kwargs)
Ok, that's working. Now, for the purposes of trying out additional steps in the pipeline, and keeping everything lightweight-enough to keep in local memory, I'll just download the first 10,000 spots associated with each of the tissues in the BodyMap dataset.
%%time
# DCM Note: After running this once locally, comment out
# so we don't repeat the download whenever notebook is run
args = ['--split-3']
kwargs = {'-X': '10000'}
accession_list = sra_table['Run']
sra_download(accession_list, *args, **kwargs)
Now that I've downloaded a sample of the data sets, let's split them into the relevant experimental condition: the single-end 75bp reads, the paired 50bp reads, and the single-end 100bp reads (this last, remember, comes from pooling the RNA from the various tissues, and is probably uninteresting for this analysis).
Then, I'll import the tissue-specific paired- and single-end reads with Biopython's SeqIO module.
from Bio import SeqIO
def bmap_reads_import(path=None):
exps, reads = {}, {}
bad = ['16 Tissues mixture', np.NaN]
sra_tiss_df = sra_table[~sra_table.loc[:, 'organism_part'].isin(bad)]
sra_tiss_df_s = sra_tiss_df.sort_values(by=['organism_part', 'LibraryLayout'])
conv_dict = {'_s': '', '_f': '_1', '_r': '_2'}
for _, row in sra_tiss_df_s.iterrows():
if row['LibraryLayout'] == 'SINGLE':
key = row['organism_part']+'_s'
exps[key] = row['Run']
else:
exps[row['organism_part']+'_f'] = row['Run']
exps[row['organism_part']+'_r'] = row['Run']
for exp, run in exps.items():
if exp not in reads.keys():
f_name = run+conv_dict[exp[-2:]]+'.fastq'
f_path = os.path.join(path, f_name)
if os.path.isfile(f_path):
with open(f_path, 'r') as f:
reads[exp] = SeqIO.to_dict(SeqIO.parse(f, 'fastq'))
return reads
%%time
paths['split3'] = os.path.join(paths['data_dir'], 'split3')
bmap_reads = bmap_reads_import(path=paths['split3'])
# Let's just check one of the records
first_rec = bmap_reads['adipose_f']['ERR030880.1']
first_rec
Ok, now it's time to check the reads' lengths, aggregate quality scores, and various other attributes. Biopython stores the individual records as instances of their SeqRecord class, and the attribute carrying the PHRED quality scores for each base in the sequence are in SeqRecord.letter_annotations, which is a dict, in this case containing the sole key of phred_quality:
print(first_rec.description)
print(first_rec.letter_annotations['phred_quality'])
Let's try plotting each tissue type's quality score per base position. I think it would be best to break the quality scores out from their default organization as being values within a nested dict, to being a numpy array where the first axis are individual reads and the second axis is positions within each read. Then I'll output these to a separate dict where each key is the experiment, and the value is the associated array.
def recover_seq_score(reads_dict):
seq_dict, score_dict = {}, {}
for exp, d in reads_dict.items():
n_records = len(d)
first_rec = d[list(d.keys())[0]]
length = int(first_rec.description.split()[-1][7:])
seq_array = np.empty((n_records, length), dtype=np.unicode_)
score_array = np.empty((n_records, length), dtype=int)
for i, record in zip(range(n_records), d.values()):
seq_array[i] = np.asarray(record.seq, dtype=np.unicode_)
score_array[i] = np.asarray(
record.letter_annotations['phred_quality'])
seq_dict[exp] = seq_array
score_dict[exp] = score_array
return seq_dict, score_dict
%%time
bmap_seq, bmap_phreds = recover_seq_score(bmap_reads)
first_seq = bmap_seq['adipose_f']
first_score = bmap_phreds['adipose_f']
print('adipose_f sequence array:\n{}\n\nadipose_f score array:\n{}'.format(
first_seq, first_score))
Now I'll need a function to plot the phred scores by position, with each tissue type in a subplot. I'll look at a few other quality measures, too, including sequence duplication level, per-base distribution of different nucleotide values, etc.
def plot_read_stats(seq_dict, score_dict):
batches = int(np.ceil(len(seq_dict) / 3))
plt.close('all')
plt.style.use('ggplot')
my_dpi = 96
fig_scale = 500
# put at most 3 experiments per row
fig, axs = plt.subplots(nrows=5*batches, ncols=3, sharex=False,
figsize=(3*fig_scale/my_dpi,
3*batches*fig_scale/my_dpi),
dpi=my_dpi)
first = seq_dict[list(seq_dict.keys())[0]]
labels = ['Per Base\nQuality', 'Per Base\nSequence\nContent', 'Per Base\nN Content',
'Per Sequence\nMean Quality\nScore', '# of Seq\nDuplication\nLevel']
ylims = [(0, 40), (0, 0.6), (0, 0.01), (0, first.shape[0]/3), (0, first.shape[0])]
for ax in np.ndenumerate(axs):
ind = ax[0][0]%5
ax[-1].set_ylim(ylims[ind])
ax[-1].set_ylabel(labels[ind])
for i, (exp, d1), d2 in zip(range(len(seq_dict)), seq_dict.items(), score_dict.values()):
batch = 5*int(np.floor(i / 3))
i2 = i % 3
axs[batch, i%3].set_title(exp, fontsize=fig_scale/25)
n_s = (d1 == 'N').sum(axis=1)
base_counts = {}
for x in ['A', 'G', 'C', 'T', 'N']:
base_counts[x] = (d1 == x).sum(axis=0) / np.count_nonzero(d1, axis=0)
per_seq_means = d2.mean(axis=1)
seq_means_counts, _ = np.histogram(per_seq_means, bins=range(0, 41))
_, cnt = np.unique(d1, axis=0, return_counts=True)
n_duplicates, _ = np.histogram(cnt, bins=range(11))
axs[batch, i2].boxplot(d2, showfliers=False,
boxprops={'color': 'salmon'},
whiskerprops={'color':'indianred'})
axs[batch, i2].set_xlim(0, 75)
axs[batch, i2].xaxis.set_ticks(np.arange(1, 75, 5))
axs[batch, i2].xaxis.set_ticklabels(np.arange(1, 75, 1)[::5])
for k, v in base_counts.items():
axs[batch+1, i2].plot(range(d1.shape[1]), v, label=k)
axs[batch+1, i2].legend(loc=1)
axs[batch+2, i2].plot(range(d1.shape[1]), base_counts['N'])
axs[batch+3, i2].plot(range(40), seq_means_counts)
axs[batch+4, i2].plot(range(-1, 9), n_duplicates)
axs[batch+4, i2].set_xlim(0, 10)
plt.tight_layout()
return axs
%%time
plot_read_stats(bmap_seq, bmap_phreds)
plt.show()
That's... surprisingly bad. And Illumina generated these data? Oh well.
Many, many of these reads are not useable. I think I'll want to discard any reads whose mean quality score is below ~20, and the first dozen or so bases from every read, given how skewed the per-base distribution is for the start of the reads, which all seem to share a very stereotypical pattern of spikes that suggests there are untrimmed adapters or something.
In the past, I've mostly used Trimmomatic, but some other sources online have suggested that BBDuk is faster. The latter is part of the BBMap suite of tools; I downloaded it on the PC to the same dir as the other tools.
According to the BodyMap project's protocol info, the RNA was ligated with illumina small RNA v1.5 adaptors. That info should be used when trimming the reads.
illum_seq = {'Small RNA PCR Primer 1': 'CAAGCAGAAGACGGCATACGA',
'Small RNA PCR Primer 2': 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA',
'Small RNA Sequencing Primer': 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC',
'v1.5 Small RNA 3\' Adapter': 'ATCTCGTATGCCGTCTTCTGCTTG'}
paths['illum_seq_fasta'] = os.path.join(paths['data_dir'], 'illum_seq.fasta')
with open(paths['illum_seq_fasta'], 'w') as f:
for name, seq in illum_seq.items():
f.write('>{}\n'.format(name))
f.write('{}\n'.format(seq))
with open(paths['illum_seq_fasta'], 'r') as f:
lines = f.readlines()
adapters = []
for line in lines:
adapters.append(line.rstrip('\n'))
print(adapters)
Alternatively, this FAQ post says that BBDuk can be used to remove all known Illumina adapter sequences, with the arg:
ref=truseq.fa.gz,truseq_rna.fa.gz,nextera.fa.gz
if o_s == 'Darwin':
paths['bbmap_dir'] = '/Users/drew/Documents/Data/Python/bbmap'
if o_s == 'Windows':
paths['bbmap_dir'] = r'C:\Users\DMacKellar\Documents\\
Python\BioPython\BBMap'
paths['bbduk'] = os.path.join(paths['bbmap_dir'], 'bbduk.sh')
# Already in 'paths', but re-acquire to keep separate
untrimmed_reads = {}
for file in os.listdir(paths['split3']):
file_under = str(file).replace('.', '_')
path = os.path.join(paths['split3'], file)
untrimmed_reads[file_under] = path
# Save a copy, since dir might change
paths['untrimmed_reads_list'] = os.path.join(paths['data_dir'],
'untrimmed_reads_list.txt')
with open(paths['untrimmed_reads_list'], 'w') as f:
for k, v in untrimmed_reads.items():
f.write('{},{}\n'.format(k, v))
untrimmed_reads_dict = {}
t1, t2, t3 = [], [], []
with open(paths['untrimmed_reads_list'], 'r') as f:
for line in f.readlines():
new_line = line.rstrip('\n')
for x in new_line.split(sep=','):
t1.append(x)
t2 = t1[0::2]
t3 = t1[1::2]
for x, y in zip(t2, t3):
untrimmed_reads_dict[x] = y
# untrimmed_reads_dict
Now, I'll try a few iterations of the BBDuk trimming program with various arguments set to trim a pared-down dataset, and visualize the output iteratively to look for improvement along the neccessary dimensions. Once I have a workable approach, I'll extend it to the rest of the reads, and visualize the entire output.
The thryoid reads look fairly representative; I'll start with them.
thyroid_reads = {}
for k, v in bmap_reads.items():
if k[:7] == 'thyroid':
thyroid_reads[k] = v
print(thyroid_reads.keys())
For reference, here's what they look like before modification:
thyroid_seq, thyroid_score = recover_seq_score(thyroid_reads)
plot_read_stats(thyroid_seq, thyroid_score)
plt.show()
Using subprocess, try a run of these through bbduk:
thyr = sra_table[sra_table.loc[:, 'organism_part'] == 'thyroid'].loc[:, 'Run'].tolist()
thyroid_files = []
for k, v in untrimmed_reads_dict.items():
if k[:9] in thyr:
thyroid_files.append(v)
thyroid_files
The docs for BBDuk say that it should be used with either paired reads or singlet reads per run, not both.
thyroid_paired = thyroid_files[:-1]
thyroid_single = thyroid_files[-1]
First, let's try adapter trimming.
Note: Ok, this worked very straightforward-ly on the Mac, but took forever to get working on the PC. On hte PC, it kept executing without complaint, but writing no output FastQ files. Basically, as the authors note under the "Standard Syntax" heading in the [BBMap usage guide](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/usage-guide/), Windows doesn't have native recognition of & automatic support for Java when executed from the command line. Instead of just specifying the absolute path to the BBDuk shell script, you have to invoke something like:
java -cp /path/to/BBMap/current jgi.BBDukF
And then specify the remainder of the arguments. In other words, you call Java, specify a 'classpath' [notes here](https://javarevisited.blogspot.com/2011/01/how-classpath-work-in-java.html) with the absolute path to the 'current' dir within the BBMap distro, then call specific scripts by specifying `subdir.script` sort of format. Obviously, I'm not very used to running Java from the command line.
import datetime
import subprocess
# Note: pass in fastq files as nested lists; pairs together
def run_bbduk(fastq_list, *args, **kwargs):
try:
trim_dirs
except NameError:
trim_dirs = {}
new_dir = '1'
else:
new_dir = max([int(x) for x in trim_dirs.keys()]) + 1
if o_s == 'Windows':
bbduk = 'java -cp {}\current jgi.BBDukF'.format(paths['bbmap_dir'])
elif o_s == 'Darwin':
bbduk = paths['bbduk']
else:
print('Error setting operating system')
kwargs_new = []
for k, v in kwargs.items():
kwargs_new.append('{}={}'.format(k, v))
try:
out_dir
except NameError:
ts = datetime.datetime.now().replace(microsecond=0).timestamp()
read_ts = datetime.datetime.fromtimestamp(ts).isoformat().replace(':', '_')
out_dir = os.path.join(os.getcwd(), 'trim_{}'.format(read_ts))
os.mkdir(out_dir)
else:
if not os.path.isdir(out_dir):
os.mkdir(out_dir)
trim_dirs[str(new_dir)] = str(out_dir)
os.chdir(out_dir)
conditions = [*args, *kwargs_new]
with open('trim_log.txt', 'w') as f:
for condition in conditions:
f.write('{}\n'.format(condition))
print('Starting trimming run with args:\n{}\n{}\n...'.format(
args, kwargs_new))
for i, fq in enumerate(fastq_list):
cmds = [bbduk, *args, *kwargs_new]
fq_base = [os.path.splitext(os.path.basename(x))[0] for x in fq]
out_files = ['{}_t.fastq'.format(os.path.join(out_dir, x)) for x in fq_base]
if len(fq) > 1:
cmds.append('in1={0} in2={1} out1={2} out2={3}'.format(fq[0], fq[1], out_files[0], out_files[1]))
elif len(fq) == 1:
cmds.append('in1={0} out1={1}'.format(fq[0], out_files[0]))
else:
print('I can\'t understand the input fastq_list; \
did you pass a nested list?')
final_cmds = ' '.join(cmd for cmd in cmds)
# final_cmds = final_cmds.split(sep=' ')
# print(final_cmds)
p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p.communicate()
if len(err) > 0:
print('\n{:>4} {} returned an error: \n{}\n'.format(
i+1, fq, err.decode('ascii')))
else:
for f in fq:
print('{:>4} {} trimmed'.format(
i+1, os.path.basename(f)))
return trim_dirs
paths['bbduk'] = os.path.join(paths['bbmap_dir'], 'bbduk.sh')
paths['split3'] = os.path.join(paths['data_dir'], 'split3')
os.chdir(paths['split3'])
kwargs = {'ref': paths['illum_seq_fasta'],
'ktrim': 'l', 'k': 31, 'mink': 11,
'hdist': 1}
args = ['tpe tbo -Xmx4g']
# args = ['ref={} ktrim=l k=31 mink=11 hdist=1 tpe tbo -Xmx4g'.format(paths['illum_seq_fasta'])]
trim_dirs = run_bbduk([thyroid_paired], *args, **kwargs)
def fastqs_import(path=None):
fastq_files = {}
for file in os.listdir():
filename, file_extension = os.path.splitext(file)
if file_extension in ['.fastq', '.fq']:
with open(file, 'r') as f:
fastq_files[filename] = SeqIO.to_dict(SeqIO.parse(f, 'fastq'))
return fastq_files
trimmed_reads = fastqs_import(os.getcwd())
trimmed_reads.keys()
# thyroid_seq_t, thyroid_score_t = recover_seq_score(trimmed_reads)
# plot_read_stats(thyroid_seq_t, thyroid_score_t)
# plt.show()
Hmm... That throws an error. Obviously, my custom-written code to return the quality metrics of reads hits hiccups when confronted with reads of variable length. I think it's time to go back to a purpose-built program to achieve this. I've used FastQC in the past, but was looking for some means of displaying any output within a Jupyter notebook, so I can keep the whole pipeline together in this one place. While searching for such an option, I came upon another program called MultiQC. An alternative is fadapa, essentially a wrapper API for Python to parse the output of FastQC.
def run_fastqc(trim_dir, *args, **kwargs):
os.chdir(trim_dir)
kwargs_new = []
for k, v in kwargs.items():
kwargs_new.append('{}={}'.format(k, v))
files = []
for file in os.listdir(trim_dir):
path = os.path.abspath(file)
_, extension = os.path.splitext(path)
if extension in ['.fq', '.fastq']:
files.append(path)
if o_s == 'Darwin':
fastqc = paths['fastqc']
multiqc = 'multiqc .'
if o_s == 'Windows':
fastqc = 'java -cp {};{} \
uk.ac.babraham.FastQC.FastQCApplication'.format(paths['bzip_dir'], paths['fastqc_dir'])
multiqc = 'python {} .'.format(os.path.join(*[paths['multiqc_dir'], 'scripts', 'multiqc']))
cmds = [fastqc, *args, *kwargs_new, *files]
final_cmds = ' '.join(cmds)
# print(final_cmds)
p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p.communicate()
if p.returncode != 0:
print('\n returned an error: \n{}\n'.format(
err.decode('ascii')))
else:
print('FastQC run')
p = subprocess.Popen(multiqc, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p.communicate()
if p.returncode != 0:
print('\n returned an error: \n{}\n'.format(
err.decode('ascii')))
else:
print('MultiQC run')
report = os.path.join(trim_dir, 'multiqc_report.html')
webbrowser.open(report)
run_fastqc(trim_dirs['1'])
Ok, on 20180621, I tried running the above run_fastqc on the PC, and of course ran into the problem mentioned above, that on Windows it has to be called with the syntax of java -cp C:\somewhere\ uk.ac.FastQC. After doing that, though, it kept returning an error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/itadaki/bzip2/BZip2InputStream
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:152)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Caused by: java.lang.ClassNotFoundException: org.itadaki.bzip2.BZip2InputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher\$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
I fought it for a while, and figured that the core problem was the top line, the org/itadaki..., was some missing dependency of FastQC. The lines below are just a traceback of more recent files that tried to call this module. I asked around at PuPPy Python Programming night, and Pat said that she concurred this was probably the problem. Often times, the relative path provided is a real-life url, i.e., I could try navigating in a browser to itadaki.org. That turned out not to be the case here, though; there was no such website.
Before I asked her, however, I had thought that maybe this was some sort of core dependency that might be part of the standard JDK or JRE distro, and I just had to append my local classpath to the line calling the FastQC classpath (joined by semicolons). This didn't end up being the case; the code is instead more obscure, and had to be hunted down, but the point is that I tried ensuring that the classpath to the core Java modules on my PC were invoked. I couldn't easily find that path on my PC (searching 'Java') into the File Explorer returns way too many results, so I googled for more info, and it said that running the command for %i in (java.exe) do @echo. %~$PATH:i in the terminal, and it did in fact return my local JDK instance. But then, trying to add this to the classpath caused more headaches, because it's in the Program Files dir, and when I try entering that as a path, Python double-slashes the space to escape it, so subprocess ends up complaining about not being able to find 'Files'... That may be specific to subprocess, because when I enter the absolute path without escaping the space, the os.listdir() command does find it.
I asked Pat about this, since she's running on PC, too. She told me about something called Windows SysInternals suite, and downloading that yields a bunch of useful tools, one of which is called junction. Changing to the dir it unzips, then running junction C:\new_name "C:\old name" will create a stable soft linkage (stable in the sense that it will be present after rebooting the OS). So you just run junction C:\Programs "C:\Program Files", and now we can get the JDK dir without any spaces.
Anyways, as I mentioned, Pat also helped me diagnose the initial problem, with the missing org.itadaki code. Copy-pasting that entire path into Google brought up a GitHub account that appeared to have the proper code. I downloaded it and put the expanded repository in the Documents\Data\Python\BioPython dir on the PC. Then I was able to direct the run_fastqc function to the appropriate subdir holding the org dir. Pat says you can then delete all the rest of that repository. Furthermore, it can be risky just downloading and running code from GitHub without having a good idea of what it's supposed to do, as it may contain malicious code.
I was alarmed by this, and pointed out that I hadn't considered GitHub to possess anything other than safe, friendly code before, and questioned how prevalent a problem the other attendees think such an issue might be. They didn't know, but point out that it's a credible risk. There is plenty of discussion of this issue online. But Pat said it was probably ok in this case because the owner of the repository was Yahoo, which is generally considered safe.
In addition to questions of security, however, there are questions of propriety, such as the legitimacy of Yahoo hosting this code, when they didn't appear to be the original owner. A more above-board way to acquire the code would be to not just take the top Google hit, but try to find the definitive version; Pat said one way to hunt this down was to enter into the GitHub search package:itadaki. Another is to follow this advice, and more generally, here is a list of other advanced search keywords for use with GitHub. Of course, there's nothing strictly enforcing such coordination and compliance as would leave a paper trail easy to follow; any author could serruptitiously (or ignorantly, as might happen in my case) copy the code without forking it, then post or deploy it elsewhere. In this case, the Network tab doesn't appear to be available on Yahoo's bzip2 github page. And searching package:itadaki on Github brings up >1K code results, the vast majority of which are in Java. None appear to have the username itadaki; I noted that they're probably mostly academic users who use FastQC or some other like it, found issues with missing dependencies, and decided to post their own local copies. It could be that the original author of the bzip2 package even predates GitHub.
Anyways, copying the code without observing the proper license can get you into legal trouble; Pat said the two main licenses granted in the open-access world are the GNU Public and MIT licenses. She indicated that the former was more restrictive, and it is, but not as much as she seemed to indicate. It is possible to provide open-source software but not permit others the right to host/redistribute it without their permission. Rather, it appears that the most stringent aspect of GPL is its status as copyleft, meaning code derived from it can't use different license terms than the original, and it sounds like this was mostly meant to prevent other authors from trying to restrict or monetize code that was originally meant to be free. The MIT license, on the other hand, has virtually no restrictions, mostly because the author doesn't care to police the use of their creation. And the Yahoo version of the gzip2 package appears to use the MIT license.
All of this is to say that I'm learning more about both Java and Open Source software. In the latter case, the whole idea of forking code from Github, as opposed to merely downloading it, is that you may have the intention of modifying it; in which case, GitHub retains historical info about where the code came from, and who is working on it. This makes it easier to recombine your efforts afterwards, which is generally done via issuing a pull request, which the original author(s) can then decide whether to commit to the main branch (i.e., the most common/authoritative version of the package), or keep separate (maybe as a useful offshoot, designed to do something slightly different from the original package). All of these practices are common in maintaining and developing open-source code between multiple authors.
The bzip2 code had to be compiled before it could be called by run_fastqc, and this is done by navigating to the root dir of the repository and running javac .. Then, specifying the proper java -cp {};{} call in the run_fastqc function has it executing successfully on the PC.
But: multiqc still isn't. I got carried away with the Java stuff; it isn't the way to get multiqc to work from the command line on the PC. Rather, if I open C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc in NotePad++, it's clear that's written in Python. But even running python C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc (with or without the terminating space and period, to indicate it's meant to search the current dir) doesn't work immediately; it returns:
Traceback (most recent call last):
File "C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc", line 89, in <module>
callback = util_functions.view_all_tags,
AttributeError: module 'multiqc.utils.util_functions' has no attribute 'view_all_tags'
When I open the script in NotePad++ to troubleshoot it, it's clear that the whole thing, while it is in Python, is composed in a complicated way that I don't immediately recognize. Specifically, the multiqc script isn't arranged as a class, but a single function (with nested functions inside it), preceded by a whole lot of what appear to be decorators ('@click.option'), apparently to handle cli-type options that are specified at runtime, to make it behave like a standard bash script. The offending line cited above (#89) is within one of these blocks, apparently meant to handle an input of '--view-tags', which I would note I'm not even invoking here. Anyways, the other file that it's calling, util_functions, does so have a function called 'view_all_tags', but for some reason Python isn't handling it appropriately.
I tried making a copy of the multiqc script in the same dir called multiqc.py (i.e., I didn't change anything other than the extension; calling python multiqc.py now returns:
Traceback (most recent call last):
File "C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc.py", line 32, in <module>
from multiqc import __version__
File "C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc.py", line 32, in <module>
from multiqc import __version__
ImportError: cannot import name '__version__'
The subdir multiqc does indeed lack a file named __version__, but provision for supplying such a variable is offered within the file in that subdir, __init__.py.
I thought about going back to using Fadapa, or making my own parser function for FastQC, but it's too big a pain to be worth abandoning multiqc at this point. Eventually, I got so annoyed that I wrote an issue to the GitHub repo; the author Phil appears to be active recently in addressing them. I'll see what happens.
Ok, on 20180623, I saw that Phil Ewell had gotten back to me overnight, and said the most likely problem was mismatched python versions used to install vs run multiqc, which isn't the problem here. He said to try Miniconda, by which I think he meant to use the conda package manager to download and install multiqc, but the bioconda repo for multiqc lacks a Windows version. He also pointed me towards a test script which ran successfully on some sort of Windows/Linux emulator site called Appveyor for testing packages.
I tried stepping through the individual install steps contained in that script (I haven't figured out how to adapt it to run as a Powershell script on Windows), but the archive it downloads appears to lack the multiqc script itself. I reported all of this on Github, and we'll see if he comes back with any other ideas.
As an alternative, I might be able to modify his test script to run on Windows with Powershell. Some example sites giving tutorials for powershell are available.
Phil responded to my response, but without much additional detail. He says that bioconda has no support for Windows (apparently, for any of its packages), in which case I responded with confusion as to why he would suggest that I use Miniconda to install multiqc.
Further, he says that the multiqc script isn't in the downloaded test_data zip file, but that the original repo is also downloaded by the test script. I don't see any line that would accomplish that, however; rather, I figured out that, since the appveyor.yml file is present in the main repo's code itself, he actually meant that he downloads the main repo to the Appveyor instance he's running, then invokes the test script from within it. That's why the test script contains (line #30) a call to python setup.py install without any context as to where that file would've come from; it's being run in the root dir of the entire main MultiQC repo. Anyways, all of this means that the code that was run successfully in his script is not different from that which I've tried and which fails to run.
I eventually went back to the line #89 that python complains about within the multiqc script, however, and looked more closely at the context. I noticed that many of the other tags above and below the @click.option( '--view-tags'... have some of the same args passed to them, but not the 'callback' flag being set in the offending line. In fact, searching for that in the rest of the script returns no hits; this is the only instance of a 'callback' flag being set in the main MultiQC script. So I figured trying commenting it out might help, and it does; the script will now run if called directly in Windows by passing 'python \path_to\multiqc_dir\scripts\multiqc \path_to\data_dir\reads_with_fastqc_reports -o \path_to\data_dir\subdir_to_write_to'. I guessed that something was going wrong with the way he was using the decorator to make an attribute of the view_all_tags function within util_functions file. I passed along this link to a blog post in my response that might be helpful, apparently there are several pitfalls to using decorators in Python when anything changes. In any case, I'm now free to use MultiQC to summarize the FastQC reports on my PC, and hopefully provide a friendlier format to display them within this notebook.
# from IPython.display import HTML
# multiqc_report = r'C:/Users/DMacKellar/Documents/Data/Bio/Bmap/split3/trim_2018-06-25T18_18_00/multiqc_report.html'
# HTML(multiqc_report)
Ok, clearly I'm gonna want to find a more compact or flexible way to represent the output of MultiQC than just to paste the entire report into this notebook. Presumably, this is configurable, but for now I'm more interested in improving the trimming; this output is still poor.
There's one other issue: opening the MultiQC report in the same tab as this notebook ends up screwing up the formatting; Ewell made the MultiQC report HTML pretty fancy, and it contains a lot of subtle touches that override some of the core context that Jupyter expects. To keep this notebook viable, therefore, I'll either have to find a way to make the report's HTML formatting flatter or, more simply, open it in another tab.
I'll go with the latter for now. This post shows some simple code that appears to work for me. This response shows another, more complicated way of doing it, that might be more flexible in the future. I'll just try adding the former code to the run_fastqc function above, so that it automatically open the new report in a separate tab.
Now, to re-run bbduk with different trimming options. The first file actually has pretty good quality per base scores out to about 30bp; but the second file (ERR030872_2_t) has mean scores below 20 up until base 11. Further, the mean quality-per-read scores are unacceptably low for a large number of reads in both files. This might not be too surprising, since the instructions I used for the first run were taken from a section entitled 'adapter trimming', and it didn't do much. I think I'll need to perform different iterations in series, starting with getting a better trimming run.
It appears that it's not recognizing the adapters in the reference file. The bbduk docs say that they include a more comprehensive list of Illumina adapters in their distro under /bbmap/resources/adapters.fa. I'll point to that and try again.
After trying a few different configurations, it looks like a good mix for adapter trimming is to reduce the value of the integers passed to the k and mink parameters. The bbduk docs had said that the value of k should 'be at most the length of the adapters', which I took to mean larger was better, but it's the kmer used to search the adapter sequences against your seq, so specifying a very long match might be too stringent. Values of k and mink around 12 and 3, respectively, seem to reduce most of the overrepresentation of ACTG in the first dozen bases of both reads.
Using the adapters file provided with bbmap causes more reads to be removed than with the custom file I made (with k=12 & mink=3, the % of bases removed is 21% as opposed to 7.5%, respectively. Alternatively, increasing the hamming distance from 1 to 2 improves 5' variability, trimming ~16% of base info.
paths['bbmap_adapt'] = os.path.join(*[paths['bbmap_dir'], 'resources', 'adapters.fa'])
os.chdir(paths['split3'])
kwargs1 = {'ref': paths['bbmap_adapt'],
# 'ref': paths['illum_seq_fasta'],
'ktrim': 'l', 'k': 15, 'mink': 3,
'hdist': 1}
kwargs2 = {'ref': paths['illum_seq_fasta'],
'ktrim': 'l', 'k': 12, 'mink': 3,
'hdist': 2}
args = ['tbo -Xmx4g']
# run_bbduk([thyroid_paired], *args, **kwargs1)
trim_dirs = run_bbduk([thyroid_paired], *args, **kwargs2)
run_fastqc(trim_dirs['1'])
Next (after adapter trimming), let's try quality trimming. Actually, after reviewing the bbduk docs (also elaborated upon here), I see that the author(s) actually recommend a specific, explicit workflow:
And additional steps that I don't quite understand, but sound like they're of diminishing importance. I'll try running the 7 steps above in succession using the default args and kwargs listed in the docs, and check the result.
Ok, after having run the entire pipeline with the default params, I see that it's too aggressive. It reduces the input 10,000 reads down to about 5,500, most of which (~3,000) are 40bp for the forward file, and 35bp for the reverse file, but each file also has significant size spikes at every multiple of 5bp in length (30, 25, etc.). Per sequence quality scores and mean scores per base are both high; the data that remain would perform very well, I think. But I'd like to reduce the amount of loss, I think some of the reads/bases being discarded could have value.
According to the printed output below, it seems that the biggest losses are occurring at the steps of Quality trimming (ditches 32% of reads and 45% of the bases) and Force trimming (ditches 17% of reads and 31% of bases). Losses at other steps are marginal. Actually, the Quality trimming loss was partially because I hadn't used the default quality cutoff of 10 from the right side, but 15 from either side. Also, with adapter trimming at the default values it's not really catching anything, which could be part of the problem with quality at later steps. I'll return that step to the params above.
Actually, it looks like the first few bases are still too problematic, not in terms of quality scores necessarily, but in terms of variability in distribution of ATCG values. It's a problem for the first 10 in the first file, but only the first 5 in the second. I think I'll return the force trim left to 10, and add a length filtering step of 20 at the penultimate step.
This approach keeps only ~4800 reads per file, with lengths between 20 and 40bp (mostly on the higher side), but the output quality scores are excellent. I may provide another variant of this workflow that's less aggressive, but this does look kind of like the best way to recover useable data from the thyroid reads. I'll wrap this workflow into a function and scale up with more reads.
# Note: pass in fastq files as a list of lists; pairs together
def run_bbduk_pipeline(fastq_list, trim_dirs=None, use_defaults=True,
args_dict=None, kwargs_dict=None,
keep_intermediate_files=False):
def return_default_dicts():
default_kwargs = {'adapter_trimming': {'ref':paths['illum_seq_fasta'],
'ktrim':'l', 'k':'15',
'mink':'4', 'hdist':'1'},
'quality_trimming': {'qtrim':'rl', 'trimq':'10'},
'force_trimming': {'ftl':'10', 'ftr':'139'},
'force-trim_modulo':{'ftm':'5'},
'quality_filtering':{'maq':'10', 'minlen':'20'},
'kmer_filtering':{'ref':paths['phi_x'], 'k':'31',
'hdist':'1'}
}
default_args = {}
for k in default_kwargs.keys():
default_args[k] = []
return default_args, default_kwargs
if trim_dirs==None:
trim_dirs = {}
new_dir = '1'
else:
new_dir = max([int(x) for x in trim_dirs.keys()]) + 1
if o_s == 'Windows':
bbduk = 'java -cp {}\current jgi.BBDukF'.format(paths['bbmap_dir'])
elif o_s == 'Darwin':
bbduk = paths['bbduk']
else:
print('Error setting operating system')
try:
out_dir
except NameError:
ts = datetime.datetime.now().replace(microsecond=0).timestamp()
read_ts = datetime.datetime.fromtimestamp(ts).isoformat().replace(':', '_')
out_dir = os.path.join(os.getcwd(), 'trim_{}'.format(read_ts))
os.mkdir(out_dir)
else:
if not os.path.isdir(out_dir):
os.mkdir(out_dir)
trim_dirs[str(new_dir)] = str(out_dir)
os.chdir(out_dir)
if use_defaults:
args, kwargs = return_default_dicts()
if args_dict:
for k, v in args_dict.items():
for k2, v2 in v.items():
args[k][k2] = v2
if kwargs_dict:
for k, v in kwargs_dict.items():
for k2, v2 in v.items():
kwargs[k][k2] = v2
for k, v in kwargs.items():
for k2, v2 in v.items():
args[k].append('{}={}'.format(k2, v2))
with open('trim_log.txt', 'w') as f:
for k, v in args.items():
f.write('{}: {}\n'.format(k, ' '.join(v)))
print('Starting trimming run with args:\n')
for k, v in args.items():
print('{}: {}'.format(k, ' '.join(v)))
print('\n\nRead files processed:')
def get_next_step(arg_list):
order_of_steps = {'format_conversion': 0, 'adapter_trimming': 1,
'contaminant_filtering': 2, 'nextera_lmp_lib_split': 3,
'human_contam_removal': 4, 'quality_trimming': 5,
'force_trimming': 6, 'force-trim_modulo': 7,
'quality_filtering': 8, 'kmer_filtering': 9,
'kmer_masking': 10, 'quality_recalib': 11,
'deduplication': 12, 'normalization': 13}
sorted_steps = sorted(arg_list, key=lambda x: order_of_steps[x])
for step in sorted_steps:
yield step
for i, fq in enumerate(fastq_list):
steps_run = []
steps = get_next_step(args)
if len(fq) > 1:
fq_base = [os.path.splitext(os.path.basename(x))[0] for x in fq]
else:
fq_base = [os.path.splitext(os.path.basename(*fq))[0]]
filepaths = [os.path.join(out_dir, x) for x in fq_base]
for step in steps:
if len(steps_run) == 0:
in_suff = ''
out_suff = '_{}'.format(step)
for f1, f2 in zip(fq, fq_base):
shutil.copy2(f1, '{}.fastq'.format(os.path.join(out_dir, f2)))
elif len(steps_run) == len(args)-1:
in_suff = '_{}'.format(steps_run[-1])
out_suff = '_t'
else:
in_suff = '_{}'.format(steps_run[-1])
out_suff = '_{}'.format(step)
infiles = ['{}{}'.format(x, in_suff) for x in filepaths]
outfiles = ['{}{}'.format(x, out_suff) for x in filepaths]
if len(fq) == 2:
files = 'in1={0}.fastq in2={1}.fastq out1={2}.fastq out2={3}.fastq'.format(infiles[0], infiles[1], outfiles[0], outfiles[1])
elif len(fq) == 1:
files = 'in={0}.fastq out1={1}.fastq'.format(infiles[0], outfiles[0])
else:
print('I can\'t understand the input fastq_list; did you pass a nested list?')
cmds = [bbduk, files, *args[step]]
final_cmds = ' '.join(cmd for cmd in cmds)
p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p.communicate()
if p.returncode == 0:
steps_run.append(step)
else:
print('{} file: {} step returned error {}'.format(fq[0], step, err.decode('ascii')))
break
print(fq_base)
if keep_intermediate_files == False:
for f in filepaths:
os.remove('{}.fastq'.format(f))
for step in steps_run[:-1]:
os.remove('{}_{}.fastq'.format(f, step))
return trim_dirs
os.chdir(paths['split3'])
trim_dirs = run_bbduk_pipeline([thyroid_paired], trim_dirs=trim_dirs)
latest_dir = trim_dirs[str(max(list(int(x) for x in trim_dirs.keys())))]
run_fastqc(latest_dir)
Ok, that seems like an extremely complicated function to run the bbduk steps in a modifiable way, but I've not got a pipeline that should work. Let's apply it to all of the BodyMap read files.
def get_fastqs(input_dir):
singlets = []
pairs = []
for file in os.listdir(input_dir):
base = os.path.splitext(file)[0]
ext = os.path.splitext(file)[-1]
if ext in ['.fq', '.fastq']:
trim_index = 0
if base[-2:] == '_t':
trim_index += -2
if base[-2+trim_index:0+trim_index] in ['_1', '_2']:
pairs.append(file)
else:
print(trim_index)
print(base[-2+trim_index:0+trim_index])
print(base[-2:])
filepath = os.path.join(input_dir, file)
singlets.append([filepath])
pairs2 = sorted(pairs)
paired = []
print(pairs)
for x, y in zip(pairs2[0::2], pairs2[1::2]):
base_x = os.path.splitext(x)[0][:-2+trim_index]
base_y = os.path.splitext(y)[0][:-2+trim_index]
print(base_x, base_y)
if base_x == base_y:
filepath_x = os.path.join(input_dir, x)
filepath_y = os.path.join(input_dir, y)
paired.append([filepath_x, filepath_y])
return singlets + paired
def get_fastqs(input_dir):
singlets = []
pairs = []
for file in os.listdir(input_dir):
base = os.path.splitext(file)[0].rstrip('_t')
ext = os.path.splitext(file)[1]
if ext in ['.fq', '.fastq']:
if base[-2:] in ['_1', '_2']:
pairs.append(file)
else:
filepath = os.path.join(input_dir, file)
singlets.append([filepath])
pairs2 = sorted(pairs)
paired = []
for x, y in zip(pairs2[0::2], pairs2[1::2]):
base_x = os.path.splitext(x)[0].rstrip('_t')[:-2]
base_y = os.path.splitext(y)[0].rstrip('_t')[:-2]
if base_x == base_y:
filepath_x = os.path.join(input_dir, x)
filepath_y = os.path.join(input_dir, y)
paired.append([filepath_x, filepath_y])
return singlets + paired
os.chdir(paths['split3'])
bmap_inputs = get_fastqs(paths['split3'])
trim_dirs = run_bbduk_pipeline(bmap_inputs, trim_dirs=trim_dirs)
latest_dir = trim_dirs[str(max(list(int(x) for x in trim_dirs.keys())))]
run_fastqc(latest_dir)
Ok, that looks good. After lots of wrangling, I'd say I'm ready to align.
The classic approach is to use TopHat and Cufflinks. It sounds like some more recent choices are STAR and BBMap. Since I've already used BBDuk, I'll try the latter first.
One major question is whether to align against the human genome or some previously-compiled transcriptome. It sounds like the former approach will catch more of the reads and have lower duplication rates (i.e., places where a read may map multiple times). It looks like, as of 20180629, the latest stable NCBI RefSeq build for the human genome is GRCh38.p12. More generally, the definitive resource for maintaining and iteratively improving human genome assembly and annotation data is the Genome Reference Consortium; the 'GRC' part means 'Genome Reference Consortium'. That consortium currently only handles maintaining the standard genomes for 4 organisms, so the first letter after GRC indicates the organism: humans ('h'), zebrafish ('z'), mouse ('m'), and chicken ('g' for 'Gallus'). '38' means that (as of Nov. 2017) this is the 38th such release, and the '.p12' subscript means that this release has been 'patched' a dozen times, representing more minor revisions to the sequence since that version was released.
Each such release contains multiple datasets; transcriptomes and genomes, both including sequence from the Human Genome Sequencing project that have successfully been assembled into super-assemblies that represent whole chromosomes, as well as a genome that includes additional large scaffolds that haven't been successfully assigned to a particular chromosome. The summary data table shows that there are two main kinds of assembly within: the more succinct and the more complete.
The records are hosted in the Genbank database, and therefore outside the reach of the sra_download function that I defined earlier. Instead, you could use NCBI's Batch Entrez website, but the programmatic solution is probably to use BioPython's EFetch. Note that this generally requires an API key affiliated with an NCBI account. Furthermore, while I thought that it would be best to use the full-featured genbank files for the reference genome (to retain annotation info), the bbmap docs state that they only accept fasta or fastq as inputs.
Finally, when it comes to setting up a BBMap run, the seqanswers thread has some useful advice.
paths['grch38'] = os.path.join(paths['data_dir'], 'GRCh38_p12_table.txt')
with open(paths['grch38'], 'r') as f:
grch38 = pd.read_csv(f, header=62, sep='\t', encoding='utf-8')
print(grch38.shape)
grch38
import json
paths['credentials_json'] = os.path.join(paths['data_dir'], 'credentials.json')
with open(paths['credentials_json'], 'r') as f:
credentials = json.load(f)
human_chr = grch38['RefSeq-Accn'].loc[:23].tolist()
human_chr.append(grch38['RefSeq-Accn'].loc[593])
human_all = grch38['RefSeq-Accn'].tolist()
# print(sorted(grch38.columns))
# human_chr
from Bio import Entrez
paths['grch38_dir'] = os.path.join(paths['data_dir'], 'grch38')
os.chdir(paths['grch38_dir'])
Entrez.email = credentials['ncbi']['email']
ncbi_api = credentials['ncbi']['api_key']
# sra_download(human_all)
def download_genbank(acc_list, email=Entrez.email, api_key=ncbi_api,
db='nuccore', rettype='fasta',
retmode='text', file_ext='.fasta'):
for acc in acc_list:
if not os.path.isfile(acc):
# Downloading...
net_handle = Entrez.efetch(db=db, id=acc, rettype=rettype, retmode=retmode)
out_handle = open('{}{}'.format(acc, file_ext), "w")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("{} downloaded".format(acc))
# # Note: cell commented out after running once;
# # don't want to download all those data again
# download_genbank(human_chr)
Now, I'll need to concatenate these into one large Fasta file, since I believe bbmap requires a single file as reference for mapping, not a dir.
# %%time
# # Note: Comment out this cell after running once
# paths['grch38_dir'] = os.path.join(paths['data_dir'], 'grch38')
# filenames = []
# for f in os.listdir(paths['grch38_dir']):
# filenames.append(os.path.join(paths['grch38_dir'], f))
# paths['concat'] = os.path.join(paths['grch38_dir'], 'hs.fasta')
# with open(paths['concat'], 'w') as outfile:
# for fname in filenames:
# with open(fname) as infile:
# for line in infile:
# outfile.write(line)
# outfile.write('\n\n')
Note that the resulting file is very large: around 3.5GB. By comparison to a calculation of the total data content of the human genome, this is very large. Anyways, let's try aligning!
bbmap_path = os.path.join(paths['bbmap_dir'], 'bbmap.sh')
bbmap_path1 = 'java -cp {}\current align2.BBMap'.format(paths['bbmap_dir'])
out_path = os.path.join(*[paths['split3'], 'bbmappings', 'manual'])
in_path = os.path.join(trim_dirs['3'], 'ERR030856_t.fastq')
ref_path = os.path.join(paths['grch38_dir'], 'hs.fasta')
ref_path1 = os.path.join(paths['grch38_dir'], 'NC_000024.10.fasta')
cmd = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, in_path, out_path, ref_path1)
print(cmd)
cmd1 = '{0} -Xmx7G ref={1} usemodulo'.format(bbmap_path1, ref_path)
print(cmd1)
sra_table.head()
sra_table.loc[sra_table['Run'] == 'ERR030856', 'Sex']
sra_table[sra_table['Run'] == 'ERR030856']
# sra_table.columns
Ok, what I'm learning is that my laptop lacks the RAM to build the index file for the whole human genome. Instead, I'll have to run a scaled-down version of the mapping, using single chromosomes. For instance, the index built fine with just NC_000024.10 (the Y chromosome). If desireable, it might be fine to loop through the different chromosome reference files for each read, but given this is mostly a proof-of-concept notebook, I think I'll be ok with just running the reads against a single chromosome.
Actually, the mapping is relatively fast, compared to the indexing operation, at least when dealing with a few thousand reads, so in such a case the more time-permissive operation would be to loop through the chromosomes first, and for each chromosome, align all of the reads to them.
I was stunned at first to run an example alignment of ERR030856 against a reference chromosome, and get the following stats out:
------------------ Results ------------------
Genome: 1
Key Length: 13
Max Indel: 200000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 6508 (486300 bases)
Mapping: 4.974 seconds.
Reads/sec: 1308.52
kBases/sec: 97.78
Read 1 data: pct reads num reads pct bases num bases
mapped: 4.1026% 267 3.2387% 15750
unambiguous: 1.7210% 112 1.7674% 8595
ambiguous: 2.3817% 155 1.4713% 7155
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 0.2151% 14 0.1141% 555
semiperfect site: 0.2151% 14 0.1141% 555
Match Rate: NA NA 82.6490% 14009
Error Rate: 20.2238% 253 17.3451% 2940
Sub Rate: 19.9840% 250 9.9941% 1694
Del Rate: 1.5987% 20 7.0796% 1200
Ins Rate: 1.5987% 20 0.2714% 46
N Rate: 0.0799% 1 0.0059% 1
Splice Rate: 0.3197% 4 (splices at least 20 bp)
Total time: 10.177 seconds.
4% of reads mapping successfully? But then I realized that, of course you wouldn't expect that many reads to align to a single one of the 23 possible chromosomes. Checking the sra_table, I find that those reads were one of the '16-tissue mixture's.
That brings up a good test of accuracy: check the ovaries and testes tissues mappings to the Y chromosome (and maybe one more, for comparison, like thyroid tissue).
bbmap_path = os.path.join(paths['bbmap_dir'], 'bbmap.sh')
bbmap_path1 = 'java -cp {}\current align2.BBMap'.format(paths['bbmap_dir'])
out_path = os.path.join(*[paths['split3'], 'bbmappings', 'manual'])
thyroid_path = os.path.join(trim_dirs['3'], 'ERR030903_t.fastq')
testes_path = os.path.join(trim_dirs['3'], 'ERR030902_t.fastq')
ovaries_path = os.path.join(trim_dirs['3'], 'ERR030901_t.fastq')
ref_path = os.path.join(paths['grch38_dir'], 'hs.fasta')
ref_path1 = os.path.join(paths['grch38_dir'], 'NC_000024.10.fasta')
cmd1 = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, thyroid_path, out_path, ref_path1)
cmd2 = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, testes_path, out_path, ref_path1)
cmd3 = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, ovaries_path, out_path, ref_path1)
print('{}\n{}\n{}'.format(cmd1, cmd2, cmd3))
# cmd1 = '{0} -Xmx4G ref={1}'.format(bbmap_path1, ref_path1)
# print(cmd1)
For Thyroid, that gives:
` Read 1 data: pct reads num reads pct bases num bases
mapped: 7.1513% 362 5.2927% 12225 `
For Testes, it's:
` Read 1 data: pct reads num reads pct bases num bases
mapped: 5.4195% 290 4.7333% 12620 `
And for Ovaries:
` Read 1 data: pct reads num reads pct bases num bases
mapped: 8.0251% 384 6.1409% 13495 `
Well, obviously that's surprising, and potentially problematic. I'll have to check some references and see if there's any reason to expect that ovaries tissue RNA might map to the Y-chromosome.
Actually, another problem occurs with the idea of mapping to different chromosomes individually: reads might align to multiple sites, and it would be hard to catch. If this happens when using the entire genome as the index, then at least you'll be aware of which reads do so, and rank the different sites as better or worse matches. If you map the reads to individual chromosomes serially, then it might be hard to track which ones have already mapped elsewhere with a better score.
So probably the best approach to do here would be just to choose a single chromosome, map all the reads to that locally (to do some proof-of-principle in this notebook), then do a more comprehensive run on Galaxy, or some other server instance where I'll have more resources. Let's just do Chr1.
...Actually, even Chr1 runs out of memory with -Xmx7G. Let's try Chr20.
grch38
def run_bbmap(reads_dir, ref_file_path, map_dirs=None, use_defaults=True, get_stats=True, args_list=None, kwargs_dict=None):
reads_files = get_fastqs(reads_dir)
for files in reads_files:
file_base = files[0].split(sep='_')[0]
if len(files) == 2:
file_str = 'in1={} in2={} rcs=f'.format(files[0], files[1])
elif len(files) == 1:
file_str = 'in={}'.format(files[0])
if map_dirs==None:
map_dirs = {}
new_dir = '1'
else:
new_dir = max([int(x) for x in trim_dirs.keys()]) + 1
try:
out_dir
except NameError:
ts = datetime.datetime.now().replace(microsecond=0).timestamp()
read_ts = datetime.datetime.fromtimestamp(ts).isoformat().replace(':', '_')
out_dir = os.path.join(os.getcwd(), 'map_{}'.format(read_ts))
os.mkdir(out_dir)
else:
if not os.path.isdir(out_dir):
os.mkdir(out_dir)
map_dirs[str(new_dir)] = str(out_dir)
os.chdir(out_dir)
if o_s == 'Windows':
bbmap = 'java -cp {}\current align2.BBMap'.format(paths['bbmap_dir'])
elif o_s == 'Darwin':
bbmap = os.path.join(paths['bbmap_dir'], 'bbmap.sh')
def return_defaults():
default_kwargs = {'maxindel': '200k', 'ambig': 'random', 'intronlen': '20', 'xstag': 'us'}
default_args = ['-Xmx7G']
return default_args, default_kwargs
reads = get_fastqs(reads_dir)
ref_base = os.path.splitext(os.path.basename(ref_file_path))[0].replace('.', '_')
print('Aligning to {}'.format(ref_base))
for read in reads:
if use_defaults==True:
args, kwargs = return_defaults()
else:
args, kwargs = [], {}
if args_list:
# imperfect, but try to replace any arg matching first 3 chars
for a1 in args_list:
for a2 in args:
if a1[:3] == a2[:3]:
args.remove(a2)
args.append(a1)
if kwargs_dict:
for k, v in kwargs_dict.items():
kwargs[k] = v
args.append('ref={}'.format(ref_file_path))
if len(read) == 2:
base_name = [os.path.splitext(os.path.basename(x))[0].rstrip('_t') for x in read]
files = 'in1={0} in2={1} out={2}_{3}_{4}.sam'.format(read[0], read[1], ref_base, base_name[0], base_name[1])
elif len(read) == 1:
base_name = [os.path.splitext(os.path.basename(*read))[0].rstrip('_t')]
files = 'in={0} out={1}_{2}.sam'.format(read[0], ref_base, base_name[0])
else:
print('I can\'t understand the input fastq_list; did you pass a nested list?')
if get_stats==True:
for stat in ['covstats', 'covhist', 'bincov']:
kwargs['{}'.format(stat)] = '{}_{}_{}.txt'.format(stat, ref_base, base_name[0])
for k, v in kwargs.items():
args.append('{}={}'.format(k, v))
cmds = [bbmap, files, *args]
final_cmds = ' '.join(cmd for cmd in cmds)
p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p.communicate()
if p.returncode == 0:
print('{} mapped'.format(read))
else:
print('{} file returned error: {}'.format(read, err.decode('ascii')))
break
return map_dirs
%%time
os.chdir(paths['split3'])
map_dirs = run_bbmap(latest_dir, paths['NC_000020_11_fasta'])
Ok, that takes a while to run, but does complete successfully. I'll try MultiQC on the resulting dir to see whether it can wrap any of the stats files output by bbmap.
def run_multiqc(in_dir):
if o_s == 'Darwin':
multiqc = 'multiqc .'
if o_s == 'Windows':
multiqc = 'python {} .'.format(os.path.join(*[paths['multiqc_dir'], 'scripts', 'multiqc']))
os.chdir(in_dir)
p = subprocess.Popen(multiqc, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p.communicate()
if p.returncode != 0:
print('\n returned an error: \n{}\n'.format(
err.decode('ascii')))
else:
print('MultiQC run')
report = os.path.join(in_dir, 'multiqc_report.html')
webbrowser.open(report)
paths['map_dir1'] = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\map_2018-07-02T20_43_53'
run_multiqc(paths['map_dir1'])
Ok, MultiQC does recognize data in that dir, but the only report it summarizes is the coverage histogram. That's a relatively succinct output; for covstats_NC_000020_11_ERR030856.txt it reads:
paths['example_covstats'] = os.path.join(
paths['map_dir1'], 'covstats_NC_000020_11_ERR030856.txt'
)
with open(paths['example_covstats'], 'r') as f:
ex_covstats_df = pd.read_csv(f, sep='\t')
ex_covstats_table = sra_table[sra_table['Run'] == 'ERR030856']
ex_covstats_tiss = ex_covstats_table['organism_part']
ex_covstats_len = ex_covstats_table['AvgSpotLen']
print('ERR030856 tissue:\n{}\n\nERR030856 read length:\n{}'.format(
ex_covstats_tiss, ex_covstats_len)
)
ex_covstats_df
In other words, for the 100bp reads run made with a mix of RNA from 16 tissues, using just the first 10,000 reads (trimmed down to ~5,000 reads), about 551 reads mapped to the 20th chromosome of the grch38 build of the H. sapiens genome, covering about 122,717bp of the 64,444,167bp in that chromosome, or about 0.19%.
That's the covhist output. BBMap can also output a bincov stats file, which (according to this wiki) instead breaks down the reference file into chunks of different bp-long locations along the chromosome, and tells how many reads mapped to that region, specifically.
paths['example_bincov'] = os.path.join(
paths['map_dir1'], 'bincov_NC_000020_11_ERR030856.txt'
)
with open(paths['example_bincov'], 'r') as f:
ex_bincov_df = pd.read_csv(f, sep='\t', header=2)
print(ex_bincov_df['Cov'].describe())
print(ex_bincov_df['Cov'].value_counts().sort_index())
ex_bincov_df.head()
In other words, this file shows every kb of Chr20, and tells how much coverage it saw (I'm guessing that, since most of the values are fractional, it's not saying whether any reads mapped within that region at all, but, rather, Cov is a measure of how many of the 1,000bp within that region are represented).
Ok, next is covhist, which looks like it shows pretty much the same info, but the other way around: it shows the number of bp in the whole chromosome that are covered 0, 1, 2, etc. times:
paths['example_covhist'] = os.path.join(paths['map_dir1'], 'covhist_NC_000020_11_ERR030857.txt')
with open(paths['example_covhist'], 'r') as f:
ex_covhist_df = pd.read_csv(f, sep='\t')
plt.bar(ex_covhist_df['#Coverage'], ex_covhist_df['numBases'])
plt.yscale('log')
plt.ylabel('[Log] Number of bases represented')
plt.xlabel('Coverage')
plt.show()
ex_covhist_df
There was also an option to output an additional stats file with the arg basecov=basecov.txt, but it was too big to keep (like, multiple gigabytes). I'll try producing a single one here, though, and parsing its content):
os.chdir(trim_dirs['3'])
if not os.path.exists('dummy'):
os.mkdir('dummy')
os.chdir('dummy')
shutil.copy2(os.path.join(trim_dirs['3'], 'ERR030856_t.fastq'), 'ERR030856_t.fastq')
kwargs_dict = {'basecov': 'basecov.txt'}
run_bbmap(os.getcwd(), paths['NC_000020_11_fasta'], kwargs_dict=kwargs_dict)
with open('basecov.txt', 'r') as f:
head = [next(f) for x in range(10)]
os.chdir(trim_dirs['3'])
with open('example_basecov.txt', 'w') as f:
for line in head:
f.write(line)
shutil.rmtree('dummy')
for line in head:
print(line)
Ok, so that basecov.txt file is 4.79GB in size; much larger than the input reference fasta. And it appears that BBMap may attempt to generate one such file for every alignment made. That would consume way too much space to be worth it. But, to summarize the output, it looks essentially like the name describing the file. Specifically, for every base in the reference chromosome, it lists the coverage from the reads, explicitly, in human-readable format (except of course for the fact that no human could feasibly have the patience to iteratively search through such an enormous number of lines to retrieve the coverage they want). That info should be recoverable through some series of operations from the output .sam file itself, which is way more compact; there's no justification for keeping those stats output files from BBMap.
Finally, there's the sam file itself. In looking for a viewer for that file format, I cam across a thread suggesting the use of software called IGV, from the Broad Institute. Unfortunately, that program requires bam file format for input, which will require me to re-run bbmap:
os.chdir(trim_dirs['3'])
if not os.path.exists('dummy'):
os.mkdir('dummy')
os.chdir('dummy')
shutil.copy2(os.path.join(trim_dirs['3'], 'ERR030856_t.fastq'), 'ERR030856_t.fastq')
kwargs_dict = {'out': 'ERR030856.bam'}
run_bbmap(os.getcwd(), paths['NC_000020_11_fasta'], kwargs_dict=kwargs_dict)
But even that output returns an error when attempting to build an index within the IGV program:
Error: java.lang.IllegalStateException: Records ERR030856.1457 HWI-BRUNOP16X_0001:1:1:13634:1239#0 length=100 (*:0) should come after ERR030856.1458 HWI-BRUNOP16X_0001:1:1:13718:1241#0 length=100 (NC_000020.11:48,839,770) when sorting with htsjdk.samtools.SAMRecordCoordinateComparator
(Sheesh, I had to copy that whole error message manually, since the stupid window that pops up with that doesn't let you select text.) Anyways, it sounds like SAM and BAM outputs have very specific tools for handling them, and specifically that BAM files must be sorted before they can be indexed. I would have assumed that BBMap would handle that automatically, but apparently there can be problems with that process. Specifically, everything relies upon a suite of programs called SAMTools, but apparently it's not built for Windows. To get the functionality to sort BAM files on Windows, I'll try BOW.
Hmm... No, BOW contains BWA, which is the Burrows-Wheeler Aligner, which is an (older) alternative algorithm for mapping reads in the first place; it should output proper BAM files, but that would obviate all the advantages of running BBMap. But I can't find the source code for SamTools within the BOW archive. I searched for SamTools on my PC, and it did find some dirs within the source code for MultiQC, but the scripts within those folders look like they're designed to call some of the functionality of SamTools, but maybe not recapitulate it directly.
On the other hand, maybe it's worth trying to install samtools on windows anyway. I extracted to C:\Users\DMacKellar\Documents\Python\BioPython\samtools-1.8, and will have to try cmake on it.
I used the bash shell to make samtools on the PC, so the executables appear to be unavailable under the native windows environment. It still took several hours to work the kinks out to get it installed. Now I'll try to use it to sort the bam file.
Well, samtools returns utterly unhelpful errors when I try samtools sort NC...:
samtools sort NC_000020_11_ERR030888.sam
hbr^?" ݸ՛ÓAchI!a# R譊)X6Cu-r\hKhmj:y㸍%$fG)$IJ݂uH]iIHD$jM<DT~$a#Ԃef@-:U͕[?zl)캍{Wº#΅T~[[A!ؽ~yj BC x^UXuny֖9S*ioIU~t\ 1H:v'|]GS!Z{Ȯɽޫ;Ozh:ZVi=[xٟ5h4?*,f壓tb/xor9oh0BHciwJp>J-oޫ>v{ŧz-ߘ'F[[\N'Gdzxx7
[E::bgzf_flush] File write failed (wrong size)
hbr^?" ݸ՛ÓAchI!a# R譊)X6Cu-r\hKhmj:y㸍%$fG)$IJ݂uH]iIHD$jM<DT~$a#Ԃef@-:U͕[?zl)캍{Wº#΅T~[[A!ؽ~yj BC x^UXuny֖9S*ioIU~t\ 1H:v'|]GS!Z{Ȯɽޫ;Ozh:ZVi=[xٟ5h4?*,f壓tb/xor9oh0BHciwJp>J-oޫ>v{ŧz-ߘ'F[[\N'Gdzxx7
[E::bgzf_close] File write failed
samtools sort: failed to create "-": Input/output error
So I figured that maybe the sam file was corrupted or improperly formatted or something, but when I open it in Notepad++ on the PC the file appears to fit the specification. So I figured maybe samtools was somehow not installing properly when using bash, so I spent HOURS trying to get it to compile with various C++ compilers on Windows: Visual Studio C++, MinGW, Mingw-64, etc.
Now I'm not entirely certain how compiling on Windows is supposed to go; you can (in the case of Visual Studio C++) do 'cl', which calls the compiler, then just specify a bunch of filenames to compile, i.e., 'cl blah.cpp blah2.cpp', but there are so many files in this package that that approach seems stupid. The more usual way to compile on UNIX is to do:
./configure
make
make-install
But trying cl configure, cl configure.ac, or cl ./configure.ac keeps returning an error that says:
configure.ac: file not recognized: File format not recognized
So, with MinGW, I tried all the various compilers available: c++, gcc, g++, etc. They all say the same damn thing.
Hours later, I found out there's a wrapper for the basic SAM/BAM functionality in Python, called pysam. But apparently it doesn't support Windows, either: attempting pip install pysam yields:
File "C:\Users\DMACKE~2\AppData\Local\Temp\pip-install-la_s7snp\pysam\setup.py", line 69, in run_make_print_config
stdout = subprocess.check_output(["make", "-s", "print-config"])
And, indeed, make isn't a valid command on Windows. I tried changing that line to cmake, c++, etc., but to no avail.
Anyways, eventually I went back to the bash-specific samtools install, and tried changing the commands, according to this workflow. Finally I found out that running:
samtools sort NC...bam -o NC..._1.bam
Will work. Apparently it was just waiting for me to specify an output file with a different name; it won't perform a sort inplace. So really, I've lost a couple of days of productivity because whoever built samtools couldn't be bothered to have it return any kind of inline error handling that gave you a clue as to its preferred input arg formatting. I should have read more documentation/examples before going down this rabbit hole, but given how poorly & minimally supported bioinformatics software is, especially on Windows, I had no real reason to expect that the thing wasn't fundamentally broken.
Anyways, I think I've now more or less figured out a workflow that should work on the PC for getting these stupid files sorted so I can visualize the outputs of the mappings.
bash = 'bash'
samtools = '/mnt/c/Users/DMacKellar/Documents/Python/BioPython/samtools-1.8/samtools'
map_dir = '/mnt/c/Users/DMacKellar/Documents/Data/Bio/Bmap/split3/map2018-07-02T20_43_53'
def sort_sam(input_dir):
os.chdir(input_dir)
sam_files = []
for file in os.listdir(input_dir):
if os.path.splitext(file)[-1] == '.sam':
sam_files.append(file)
if o_s == 'Windows':
samtools = '/mnt/c/Users/DMacKellar/Documents/Python/BioPython/samtools-1.8/samtools'
sort_cmds = ['bash']
index_cmds = ['bash']
for sam in sam_files[::10]:
# subprocess can't handle arbitrarily long cmd line strings
fname = os.path.splitext(os.path.basename(sam))[0]
# print(fname)
sort_cmds.append('{0} sort {1} -o {2}.bam'.format(samtools, sam, fname))
index_cmds.append('{0} index {2}.bam'.format(samtools, sam, fname))
final_sort_cmds = ';'.join(sort_cmds)
final_index_cmds = ';'.join(index_cmds)
p1 = subprocess.Popen(final_sort_cmds, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p1.communicate()
if p1.returncode != 0:
print('\n returned an error: \n{}\n'.format(
err.decode('ascii')))
else:
print('Sam files sorted.')
p2 = subprocess.Popen(final_index_cmds, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
out, err = p2.communicate()
if p2.returncode != 0:
print('\n returned an error: \n{}\n'.format(
err.decode('ascii')))
else:
print('Bam files indexed.')
# report = os.path.join(in_dir, 'multiqc_report.html')
# webbrowser.open(report)
map_dir1 = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\map_2018-07-02T20_43_53'
sort_sam(map_dir1)
Ok, that's working, but the IGV viewer tool isn't really showing anything for the output, sorted BAM files. Maybe that's because I used so few reads, and they're expecting a lot higher coverage?
Yeah, that's about it. I have to zoom in to like 75% of the way towards the most zoomed in that I can get to see anything. Anyway, the reads are there. That fine-grained a view of the output of the mapping isn't really that suitable for handling so many different reads and templates as this, however. It's time to move on with stats to try to quantify differential gene expression.
Given that my laptop's RAM doesn't even successfully index the reference sequence of many human chromosomes with BBMap, it's clear that the scale of this analysis exercise exceeds local computational resources. To get access to more memory and processing power to handle more of the BodyMap data, I could of course turn to popular, (inexpensive) fee-based services like AWS or Google Cloud Compute. An alternative, however, that makes significant computational resources available for free that is geared specifically towards bioinformatics data analysis, is Galaxy. Finally, there are some videos available, derived from an ongoing series of conferences focused on Galaxy.
Galaxy is meant as an approachable, GUI-based platform for non-specialists in computation, and it would be somewhat tedious to reproduce many of the steps I've carried out in this notebook with their manual browser interface, but fortunately they also allow programmatic access via an api, and equally happily, someone has developed Python-specific bindings for that API via a package called BioBlend. Their processes aren't unlimited, however, and so they offer another interface called CloudMan as a Galaxy-ecosystem wrapper to handle especially large jobs, which sounds like it's a wrapper that actually outsources the job to AWS.
I'm having difficulty finding hard limits on the resources made available for free in the Galaxy Main instance, and those that would necessitate additional (fee-based) power. I suppose that I can begin by attempting some data transfer to, and running various-sized jobs on, a Galaxy Main instance, and seeing if they provide any error messages or personalized guidance if I exceed my welcome as a free user. This of course requires an API key.
Actually, wait: I finally found an outline of limits they expect you to obey.
galaxy_api = credentials['galaxy']['api_key']
from bioblend import galaxy
gi = galaxy.GalaxyInstance(url='https://usegalaxy.org/', key=galaxy_api)
gi.config.get_version()
Now to start organizing the workspace and downloading datasets. Additional examples of the syntax can be found here. There's also a series of short Jupyter Notebooks showing some of its functionality, apparently produced by one of the developers of BioBlend, here.
Note that Galaxy (as run from within the browser) appears to have the option to get data directly from NCBI SRA, but not from NCBI Nucleotide. It does, however, have the ability to access data directly from the UCSC genome browser, which does have access to the Human Genome GRCh38 build, which I've been using so far in this notebook, so the outputs should be comparable. Unfortunately, figuring out which tracks contain the raw nucleotide sequence of the genome is proving difficult; I'll have to find another source for the reference to which to align the reads.
Furthermore, it looks like the default organization of data within the BioBlend API is to generate/return Libraries of data, which each contain Folders and can have different permissions set.
The organization of the Galaxy Instance returned by BioBlend is a little complex. The Instance returned doesn't contain all of the data available, just the necessary credentials to access it by querying the (distant) Galaxy server. So subsequent querying of the instance and its available data can take significant amounts of time. The default Galaxy Instance returned from BioBlend has several nested classes: folders, genomes, histories, jobs, libraries, workflows. It has a quotas file, too, but that appears not to show you the resources associated with the instance unless you have admin privileges. The documentation on these files/classes isn't great, so I've mostly found few things out by just iteratively running the built-in Python dir() function on each of them (i.e., 'dir(gi.libraries)').
It looks like most of the built-in data are contained in libraries and genomes, which are independent. By scanning through a list of the libraries, you can find out that the Illumina BodyMap data are already uploaded to & accessible to a Galaxy Instance by default:
libs = gi.libraries.get_libraries()
for i, lib in enumerate(libs):
if 'Illumina' in lib['name']:
print('{:>2} {} {:<30} {}'.format(i, lib['id'], lib['name'], lib['description']))
gi_bmap_lib = libs[25]['id']
Within each library are one or more folders; you can retrieve the ids and names for each folder within the gi.libraries class, using the gi.libraries.get_folders() method, but you can't actually see the data within that folder. You can see individual folders by taking one of the ids output by the gi.libraries.get_folders() method and feeding it to the gi.libraries.show_folder() method, but that won't return any more info about it, just the same exact dict output by the get_folders method, but in isolation. To get more info, you have to back out of the gi.libraries subclass and go into gi.folders subclass. If you call the gi.folders.show_folder() method, you can pass it an individual folder id, and it will default to showing you the same data as the gi.libraries.show_folder() method, unless you pass this former method the additional arg contents=True.
This will output a dict. One key, value pair of that dict will be the key 'metadata', whose value is another dict. The other key, value pair output by calling gi.folders.show_folder(folder_id, contents=True) will be the key 'folder_contents', whose value is a list of dicts. But things get trickier. For each dict in the list of dicts corresponding to the key 'folder_contents', the dict will have a key called 'type', and that type can be either 'folder' or 'file'. If the type is 'folder', it will also list the separate ID of that folder. If you call gi.folders.show_folder() on these nested folders, you can see whether they have yet more folders inside of them, or just files.
gi_bmap_folder = gi.libraries.get_folders(gi_bmap_lib)
gi_bmap_folder
folders = gi.folders.show_folder(gi_bmap_folder[0]['id'], contents=True)
folders
gi_bmap_files = {}
for folder in folders['folder_contents']:
gi_bmap_files[folder['name']] = folder['id']
print(len(gi_bmap_files))
If you get down deep enough in this nested folder hierarchy to find files, you can use the gi.folders.show_folder() method to retrieve specific IDs associated with each individual file. That specific ID is the proper string to identify a 'dataset', even though there's nothing in the formatting of the BioBlend code or docs to really suggest that the two overlap. You can then get metadata about the file by calling gi.datasets.show_dataset() if you pass as args the file id output by the show_folder() call, and pass the arg hda_ldda='ldda'. You can also presumably then access the data locally by passing gi.datasets.download_dataset() by passing that same file id.
Ok, so now I know that the native data for the Illumina BodyMap project are available to Galaxy; now I need to know how to use them. It appears that nothing gets done computationally until you set up a workflow of steps to be done, and then pass gi.workflows.invoke_workflow() to the galaxy instance. What's unclear to me is whether this defaults to running locally (i.e., on my PC, which won't accomplish anything that I couldn't do on my own and with way fewer complications), or whether it schedules it for running on the Galaxy Main server. It would be a good idea to try out a sample task, and submit it.
Ok, more progress: I figured out that using Galaxy is very reliant on the "History" concept: it was largely intended to help standardize and make more reproducible NGS workflows, after all. So you want to go into the current Galaxy Instance's histories class, and return the custom ID that they've assigned for keeping track of your session using the gi.histories.get_current_history() method, then pass that, along with the specific file ID you determined using the libraries and folders classes above, and it'll copy those data (or perhaps an indexer to those data as instantiated on the main Galaxy server) to your Galaxy account; it shows up in the History tab of the GUI in the browser.
With the history_id and bodymap dataset_ids in hand, you can copy the dataset to your history with the command gi.histories.upload_dataset_from_library(), passing in the two ids.
Now, given that the quota guidelines for individual users' histories on Galaxy Main cap storage space at 200G, and given the large size of the Illumina BodyMap datasets:
sizes = []
for folder in folders['folder_contents']:
sizes.append(float(folder['file_size'].rstrip(' GB')))
print(sum(sizes))
We won't be able to process all of the runs in a single history. In fact, given that qc and trimming of each dataset is likely to increase the space required for it by at least 2 or 3 fold, it seems prudent to only process a handful of runs per history. For now, I'll proceed with completing the pipeline of a single experiment as a trial, and, once a reasonable workflow has been established, scale up.
# # DCM Note: comment out after running once
# gi_hist = gi.histories.get_current_history()['id']
# for k, v in gi_bmap_files.items():
# if 'thyroid' in k:
# gi.histories.upload_dataset_from_library(gi_hist, v)
Once you have sent any dataset from a library to your active history, you can see all of the datasets in your history by passing gi.histories.show_history(gi_hist); this will return a list of arbitrary dataset ids, as well as their status ('ok', 'running', etc.), but they won't show additional info to make sense of which dataset has which contents. To get that, pass the additional arg contents=True to the command above, and it'll spit out a list of dicts with dataset names, as well; if you figure out the right identifying fields to pass you can parse this and save IDs as a Python variable that'll ease future access.
gi_hist_datasets = {}
for ds in gi.histories.show_history(gi_hist, contents=True):
if ds['history_content_type'] == 'dataset' and ds['deleted'] == False:
gi_hist_datasets[ds['name']] = ds['id']
for k, v in gi_hist_datasets.items():
print('{}: {}'.format(v, k))
sizes = []
for k, ds in gi_hist_datasets.items():
if 'thyroid' in k:
d = gi.histories.show_dataset(gi_hist, ds)
size = float(d['misc_blurb'].rstrip(' Gb'))
sizes.append(size)
print(sizes)
Unfortunately, I can't find a way to access anything like a unique dataset ID for the GRCh38 genome that's present under the gi.genomes class to send it to the history as well. And UCSC Browser-based copy of that genome that I tried to copy over into Galaxy manually appears to be more of a gene list than the raw genome; the tracks feature within that tool for selecting what info to retain when downloading really didn't make it clear what someone should choose to get just the raw nucleotide sequence. I'll try instead sending from NCBI, using gi.genomes.install_genome().
Nope; I still couldn't find a unique NCBI accession number to have Galaxy download it directly. The in-line help for gi.genomes.install_genome() says that the 'ncbi_name' arg expects 'NCBI's genome identifier', suggesting that the accession number should come specifically from the 'Genome' database within the NCBI site, and I can't find any specific identifier that would accord with GRCh38 in that repository. Furthermore, I expect that, for this project, I'll want to upload not just the raw genome assembly sequence, but will in fact want to align against the predicted human transcriptome. That can be difficult if your organism is obscure; you may have to actually build the putative transcriptome yourself from a genome assembly and lots of RNAseq reads, prior to performing gene quantification with the RNAseq data. But in this case, we're dealing with a well-covered and important organism, whose transcriptome has been derived from many data sources ans built by very well-informed researchers; I doubt that I'm going to learn anything new that they haven't. So in this case I'll just want to download the relevant RNA transcriptome predicted from the GRCh38.p12 assembly.
I found a unique page on NCBI corresponding to that patch, and following links to the annotation report, you can find a link to the ftp site. Reading through the README file in the base dir suggests that the relevant data that I'm looking for are in the /RNA subdir, and by downloading both and checking, I find that the specific file of most utility is probably not the 125MB 'Gnomon_mRNA.fsa.gz', but rather the 70MB rna.fa.gz file.
A note about the transcriptome: the downloaded FASTA file has 159,998 instances of the '>' character, which should indicate that many transcripts are present in the build. This is very close to the sum (160,474) of the mRNAs (113,620) and non-coding transcripts (46,854) listed in the NCBI annotation report for GRCh38.p12. That same report breaks down the transcriptome's content as possessing 54,644 'genes and pseudogenes', including 20,203 protein-coding genes, and 17,871 non-coding gene, and 20,110 genes with variants, meaning nearly every protein-coding gene comes with splice variants.
Unfortunately, it doesn't seem like there's a way to transfer this directly from NCBI to Galaxy, probably because the only method that might suffice, gi.tools.upload_from_ftp(), doesn't really provide documentation on how to pass in credentials to satisfy NCBI. Trying the method with the proper FTP URL and history ID, with no additional params, returns: Unexpected HTTP status code: 400.
Instead, it seems to be prudent to download the transcriptome file to my local machine, then upload it to Galaxy with gi.tools.upload_from_file():
#!/usr/bin/python
# DCM Note: code modified from http://rizwanansari.net/download-all-files-from-ftp-in-python/
import ftplib
import time
kwargs = {'server': 'ftp.ncbi.nlm.nih.gov',
'user': 'anonymous', 'password': credentials['ncbi']['email'],
'path': '/genomes/Homo_sapiens/RNA/', 'fname': None,
'destination': paths['data_dir']}
def download_ftp(**kwargs):
server = kwargs['server']
user = kwargs['user']
password = kwargs['password']
path = kwargs['path']
fname = kwargs['fname']
destination = kwargs['destination']
interval = 0.05
ftp = ftplib.FTP(server)
ftp.login(user, password)
try:
ftp.cwd(path)
os.chdir(destination)
except OSError:
pass
except ftplib.error_perm:
print("Error: could not change to " + path)
sys.exit("Ending Application")
filelist=ftp.nlst()
for file in filelist:
if file == fname:
time.sleep(interval)
try:
ftp.retrbinary("RETR " + file, open(os.path.join(destination, file),"wb").write)
print("Downloaded: " + file)
except:
print("Error: File could not be downloaded: " + file)
else:
continue
kwargs = {'server': 'ftp.ncbi.nlm.nih.gov', 'path': '/genomes/Homo_sapiens/RNA/', 'password': credentials['ncbi']['email'],
'user': 'anonymous', 'destination': paths['data_dir'], 'fname': 'rna.fa.gz'}
download_ftp(**kwargs)
gi_hist = gi.histories.get_current_history()['id']
local_transcriptome = os.path.join(paths['data_dir'], 'rna.fa.gz')
gi.tools.upload_file(local_transcriptome, gi_hist)
Ok! I now have the Illumina BodyMap Thyroid datasets and the GRCh38.p12 transcriptome uploaded to my Galaxy Instance's history. I'm now ready to run the reads through FastQC and determine their quality. Now, the workflow class of BioBlend is pretty complex, and not really meant to be written by hand. Rather, to expedite the process, I'll invoke the job through the browser interface to Galaxy, then export the Workflow/history, and copy it here to allow it to be run programmatically, as well.
FastQC and MultiQC are available by default to Galaxy Main; I'll use both to check the Thyroid reads.
Unfortunately, the jobs take a significant amount of time to run on the Galaxy Main server, so the results will not be available immediately. Instead, the status of the job can be monitored with gi.histories.get_status():
gi.histories.get_status(gi_hist)
It should be possible, from this, to write some kind of function to monitor the status of your Galaxy Instance's History's status, and initiate subsequent jobs automatically when previous steps complete, but that functionality would only be of use while I'm going through the process of building up the workflow. Once a viable workflow has been executed and exported, it should be possible to intiate it with a couple of steps, passing the workflow script to the Galaxy Main server, and trying gi.workflows.invoke_workflow().
Anyways, once the first part is complete, I'll download the workflow.
thyroid_fastqc_workflow_id = gi.workflows.get_workflows()[0]['id']
thyroid_fastqc_workflow = gi.workflows.export_workflow_dict(thyroid_fastqc_workflow_id)
type(thyroid_fastqc_workflow)
thyroid_fastqc_workflow.keys()
steps = []
for k, v in thyroid_fastqc_workflow['steps'].items():
steps.append(v)
for step in steps:
print(step['name'])
steps[-3:]
That should give a general impression of the number of specific inputs required to specify a simple task like running FastQC on three files. It's not something I want to reproduce manually. Instead, I'll continue with the workflow via the GUI, then export the whole thing at the end.
In the meantime, let's show some plots here, to summarize the thyroid reads' quality:
thyroid_multiqc_plots_dir = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\galaxy\thyroid_multiqc_1_files'
thyroid_pngs = []
for file in os.listdir(thyroid_multiqc_plots_dir):
if file[:7] == 'fastqc_':
print(file)
thyroid_pngs.append(os.path.join(thyroid_multiqc_plots_dir, file))
from IPython.display import Image
Image(thyroid_pngs[2])
Image(thyroid_pngs[4])
So, surprisingly, the data don't look that bad for the thyroid reads, at least when taken as a whole. Contrast this to the lousy results seen when just looking at the first 10,000 reads from any dataset when I was verifying their quality on my local machine. Perhaps the first few spots included in any run are more likely to be skewed to a lower quality when compared to the entirety of the experiment?
Since my main objective with this analysis is to quantification of gene expression across these tissues, I'll also want to make sure that we do see some transcripts more than once. We might not expect the same portion of a transcript to show up in many reads, and therefore sequence duplication levels might not be the most important metric to establish this now, but it still would be good to confirm that there is some amount of sequence duplication present in these reads:
Image(thyroid_pngs[5])
In any case, even though they look mostly fine, it's still clear that the first dozen or so bases do run into the problem listed before about heavy bias in the identity of different nucleotides (unfortunately, MultiQC says that this type of report plot alone doesn't export as a png, so I had to use printscreen, and thus the quality takes a hit):
per_seq_qual = os.path.join(thyroid_multiqc_plots_dir, 'per_base_seq_grab.png')
Image(per_seq_qual)
This means that I'll definitely need to trim. Unfortunately, the BBTools suite doesn't appear to be available on Galaxy by default:
gi.jobs.get_jobs()
tools = gi.tools.get_tools()
len(tools)
tools[0]
tool_descriptions = {}
for tool in tools:
tool_descriptions[tool['name']] = tool['description']
tool_descriptions_S = pd.Series(list(tool_descriptions.values()))
tool_descriptions_S.unique().shape
for x in tool_descriptions.keys():
if 'map' in x:
print(x)
Rather, their main tool for editing reads is Trimmomatic.
Actually, there are alternative, simpler tools that can be run in series, which are based on the FastX-Toolkit. The author of that package even includes docs on how to use it in the context of Galaxy. These simpler steps include trimming by length, trimming by quality, filtering by read quality, clipping adapter sequences, removing 'sequencing artifacts', etc. I haven't used this suite of tools before, but am intrigued by having these steps split out explicitly, so I'll try it with the thyroid reads:
The options available in the FastX-Tool for trimming by length include specifying a first base to keep; it may be too aggressive, but I'm going to chop the first dozen bases from every read off. Then I'll use more gentle filtering by read quality (90% of bases within a read must have PHRED>20), clipping adapters, and trying that 'sequencing artifact' removal step.
Actually, after scheduling these steps, I found that the adapter-clipping step had errored out; I wasn't sure how to choose the options for that tool, and apparently failed to supply an input adapter sequence to target. The tool's window has a dropdown menu that is titled 'Source', and has only two options: 'Standard (select from the list below)', and 'Enter custom sequence'. There's an empty field below it that's definitely meant to receive input related to this choice, and it's titled 'Choose adapter'. Rather than feeding it a single adapter sequence, I had gambled that the first option, 'Standard' would pass a default library of adapter sequences, and that these would likely include common Illumina mRNA-seq kit, and further that the '(select from the list below)' part of that line was perhaps misplaced, referring instead to the dropdown window containing that line, since there doesn't seem to be another such 'list' anywhere in the tool's form page that would correspond to such a note.
Instead, the tool had apparently expected me to input a single adapter sequence, because the tool stalled with an error (I'm having difficulty, one day later, tracking down the exact string returned by that error, because I'm unsure of the particular BioBlend command syntax to retrieve it). The man page for the CLI-based version of FastX-Tools indicates that the clipper tool expects a single custom sequence be input at runtime:
[-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
Anyways, since subsequent steps were dependent upon the output of that tool in order to execute, the rest of the quality control steps stalled, and as a result the run didn't finish overnight, but needs to be re-run. That may be an advantage of substituting a tool like Trimmomatic, which executes as a single pipeline, rather than the separate steps of FastX-Tools. I had checked the Galaxy Main server from my phone after quitting last night, and had seen that the jobs had stalled, but I couldn't supply an adapter easily in that format, so I had tried just deleting the stalled step and the subsequent steps, then calling just the follow-up steps, disregarding the adapter clipping. But I was unable to figure out how to resume the workflow over my phone at the time. In fact, when I checked it again today on my PC I still couldn't find a way to resume it. The workflow steps were highlighted in light blue, saying something like 'this job is paused; select "Resume Paused Jobs" under "History Options" to restart it', but following those instructions had no effect on the workflow. I tried looking for guidance on that particular issue, but it sounds like this may be an acknowledged bug, and my workaround was the most expedient. Too bad.
Anyways, I'm skipping this adapter clipping step for now; I'm unsure whether it may be a legitimate concern... Actually, I just reviewed the MultiQC report on the raw thyroid reads, and it does include a plot called 'Adapter Content', which indicates that all three Thyroid read files passed whatever threshold FastQC applies for that criterion, but the plot does indicate that adapter content increases towards the end of reads up to a maximum of 0.17% of content towards the 3' end of the SE run (around base 64). That's not too bad, but if I can get rid of it it's worth adding one more step to the pipeline. The trace indicates that the identity of the adapter it has detected is the Illumina Universal Adapter. I consulted Google to get the sequence corresponding to this adapter, and found this page. Although it lists the Universal Adapter under two different contexts (the TruSeq and TruSight kits), they appear to have the exact same sequence. So I entered that into the FastX-Tools adapter clip step and tacked that step on to the end of the workflows to modify the Thyroid read files.
Those steps are taking a while to run, especially for the SE run (up to 65bp in length, compared to 50bp each for the PE files, although the input file sizes were about the same). If I check the job history, either through the BioBlend API or the Browser interface, it returns a 'Created' timestamp for the dataset, in UTC, and Google says that 'Coordinated Universal Time is 7 hours ahead of Pacific Time'. This jibes with my memory in the case of the first trim steps, which I initiated around 8:30PM on 20180710; the timestamp says 'Wed 11 Jul 2018 03:24:49 AM (UTC)'. The earliest QC steps today (20180711) were started at ~5:45PM, and are still running two hours later.
gi_hist = gi.histories.get_current_history()['id']
gi.histories.get_status(gi_hist)
# gi.jobs.get_state()
Ok, let's check the plots from MultiQC after trimming:
if o_s == 'Darwin':
thyroid_multiqc_plots_dir_2 = '/Users/drew/data/Bio/galaxy/thyroid_multiqc_2_files'
thyroid_pngs_2 = []
for file in os.listdir(thyroid_multiqc_plots_dir_2):
if file[:7] == 'fastqc_':
print(file)
thyroid_pngs_2.append(os.path.join(thyroid_multiqc_plots_dir_2, file))
from IPython.display import Image
Image(thyroid_pngs_2[1])
Image(thyroid_pngs_2[2])
Image(thyroid_pngs_2[0])
Image(thyroid_pngs_2[-1])
Those all look pretty good. It's interesting that the representation of sequences that have duplicates actually seems to have increased (i.e., chech out how the peak over the '>10' bin has gotten closer to the '25%' tickmark on the y axis). Presumably, this occurred at the quality filtering step; trimming the reads shouldn't have increased this score. It must be that some of the unique reads were of particularly low quality. This step now actually fails MultiQC's automated cutoff for quality, but I imagine that, if anything, it will make the quantification step more interesting.
As Conesa et al. 2016 points out, when mapping against a transcriptome (as opposed to a genome), you can use an ungapped mapper, and Bowtie is the most popular choice, historically. If you're aligning against the genome, you have to take into account the mismatch between the intron-containing nature of genomic sequences and the intron-absent (and alternatively spliced) nature of the cDNA from which the reads were amplified; in this case the most popular choice historically has been TopHat.
But furthermore, when mapping reads which represent such short segments of transcriptomes that possess splice variants, the result will be that many, many of the reads will map to multiple different transcripts in the transcriptome equally well, which complicates quantification of gene expression from RNAseq data.
A most straightforward, naive approach would simply be to admit of all multi-mapped reads, and report how many reads aligned to every single transcript or gene in the reference sequence. But even with this most naive approach, another complication comes from the fact that genes come with different lengths, and read libraries from which counts are drawn vary in size. The difficulty with the latter is obvious (total counts will be higher whenever you have more reads), but the problem with the former is simply that longer genes will, for any given copy number, represent a larger number of total nucleotides in the input cDNA library, and therefore be biased towards being represented in more reads than shorter genes. To make comparisons across different sequencing experiments possible, practitioners have introduced a normalizing parameter, 'reads per kilobase of exon model per million reads', or RPKM. Further, they state that two derivatives of this measure, 'FPKM (fragments per kilobase of exon model per million mapped reads), a within-sample normalized transcript expression measure analogous to RPKs, and TPM (transcripts per million) are the most frequently reported RNA-seq gene expression values'. Finally, and obviously, there are biases in the amplification steps involved in any sequencing protocol, especially in GC content, but also in repetitive sequences or other structural considerations of the template DNA.
As to the proper tools that were built with these issues in mind, I'll just copy verbatim the Conesa_2016 suggestion:
'Algorithms that quantify expression from transcriptome mappings include RSEM (RNA-Seq by Expectation Maximization) [40], eXpress [41], Sailfish [35] and kallisto [42] among others. These methods allocate multi-mapping reads among transcript and output within-sample normalized values corrected for sequencing biases [35, 41, 43]. Additionally, the RSEM algorithm uses an expectation maximization approach that returns TPM values [40]. NURD [44] provides an efficient way of estimating transcript expression from SE reads with a low memory and computing cost.'
Let's check for which of these are already available to the Galaxy Instance, to see what options I'll have going forward:
looking_for = ['RSEM', 'eXpress', 'RNA-Seq by Expectation Maximization', 'Sailfish', 'Kallisto', 'Salmon', 'NURD', 'Bowtie', 'Tophat', 'Cufflinks']
for x in set(tool_descriptions.keys()):
for looked in looking_for:
if looked in x:
print(x)
I should note that Galaxy also separates the (substantial number of; 1,626 as of 20180711) tools already compiled and available to run immediately versus a stable of some others (751 in number as of 20180711) that are available to install prior to running, in the 'toolshed':
toolshed = {}
for tool in gi.toolShed.get_repositories():
toolshed[tool['name']] = tool['id']
print(len(toolshed))
for x in set(toolshed.keys()):
for looked in looking_for:
if looked in x:
print(x)
list(toolshed.keys())[:20]
But apparently, it turns out that none of those missing from the tools list are present in the toolshed.
Additionally, Galaxy itself has a pretty good overview of the ins and outs of the RNAseq pipeline, and they say that RPKM, FPKM, and TPM are not suitable for comparison of expression levels across samples; they're relative measures. This seems to conflict from my reading of Conesa2016. The Galaxy article links to this blog post for further clarification.
All right, I want to set up a first mapping/quantification run, and think I'll try Sailfish first, since it claims to be pretty fast. The Galaxy interface for that tool asks first if you want to use a standard transcriptome/index, or supply your own. I'm doing the latter, with the NCBI version of GRCh38.p12 rna.fa.gz that I uploaded. Next, they ask for 'The size of the k-mer on which the index is built'; the default in Galaxy's interface is 21, but the Sailfish FAQ says that their CLI-based default k-mer count is 31. They also say that it doesn't appear to be that sensitive a parameter, however, so I'll stick with 21 for the first pass. The NCBI Annotation Report doesn't appear to supply any guidance on whether this was a relevant parameter when building the transcriptome and, if so, what value they used.
They also ask, per run, whether the library is SE or PE reads; they don't have an option addressing both, so I guess I'll have to set up separate runs for the two types. In this history, the fully-trimmed SE file is dataset #53: clip on data 50. The paired end runs are datasets #51 & 52.
But a little further down, there's a parameter called 'File containing a mapping of transcripts to genes', and says it's looking for a GTF file, and that it's for calculating 'Gene-level abundance estimations', clearly addressing the point I brought up earlier about the distinction between whether you're counting the number of reads that map to a specific transcript or to the concept of a gene more generally in a genome. The annotation report doesn't mention this, but if you go to the FTP's README file, searching for 'gtf' brings up just 4 hits, in a section titled 'org_transcript.gff.gz and zoo_transcript.gff.gz files', saying that 'These files provide cDNA-to-Genomic, or spliced sequence alignments. These files include same-species and cross-species alignments, respectively. Alignments are generated via the Splign alignment tool'. It claims that those files are located in the 'MAPVIEW' directory, but there is no such directory located in the top level of the FTP site. If I navigate to 'ARCHIVE', 'ANNOTATION_RELEASE.108', there is such a dir, but the files within are all empty (they have a 'size' of 0B). It's the same thing for other ANNOTATION_RELEASEs.
The closest thing I can find appears to be in a subdir called 'GFF', which just contains one file: 'ref_GRCh38.p12_top_level.gff3'. Its contents look vaguely like a list mapping individual genese to genomic coordinates:
NC_000001.11 Curated Genomic pseudogene 131068 134836 . + . ID=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;Name=CICP27;description=capicua transc
riptional repressor pseudogene 27;gbkey=Gene;gene=CICP27;gene_biotype=pseudogene;pseudo=true
NC_000001.11 Curated Genomic exon 131068 132927 . + . ID=id100;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11 Curated Genomic exon 132987 133322 . + . ID=id101;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11 Curated Genomic exon 133733 134058 . + . ID=id102;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11 Curated Genomic exon 134378 134836 . + . ID=id103;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11 BestRefSeq gene 134773 140566 . - . ID=gene12;Dbxref=GeneID:729737;Name=LOC729737;description=uncharacterized LOC729737;gbkey=Gene;g
ene=LOC729737;gene_biotype=lncRNA
NC_000001.11 BestRefSeq lnc_RNA 134773 140566 . - . ID=rna24;Parent=gene12;Dbxref=GeneID:729737,Genbank:NR_039983.2;Name=NR_039983.2;gbkey=ncRNA;gen
e=LOC729737;product=uncharacterized LOC729737;transcript_id=NR_039983.2
NC_000001.11 BestRefSeq exon 140075 140566 . - . ID=id104;Parent=rna24;Dbxref=GeneID:729737,Genbank:NR_039983.2;gbkey=ncRNA;gene=LOC729737;produc
t=uncharacterized LOC729737;transcript_id=NR_039983.2
But it's not easy to tell whether they'll fit the bill. Since, however, as I said I can see no other candidate, I'll try passing this file to the program.
gff = os.path.join(paths['data_dir'], 'ref_GRCh38.p12_top_level.gff3.gz')
gi.tools.upload_file(gff, gi_hist)
After this, there were a number of arguments that I left in their default mode. Then there's a field called 'Caculate Effective Lengths', whose default value is 200. I couldn't figure out what this meant, until I read the next field, called 'Standard deviation', whose description made me think it's asking for read lengths. But the Sailfish docs don't mention this parameter, nor the follow-on options. I think I'll leave all of these at their default values and just run for now.
That scheduled; we'll see what the output is. I also want to start a paired-end run with Sailfish, and specified that 'Mate pair 1' is dataset '#51 Clip on data 48', that 'Mate pair 2' is dataset '#52 Clip on data 49', and furthermore that 'Relative orientation of reads within a pair' is 'Mates are oriented toward each other (I = inward)'. I passed the same ref_GRCh38.p12_top_level.gff3.gz file as the 'File containing a mapping of transcripts to genes', left other params as their defaults, and scheduled the run.
While I was setting up the PE run, however, the jobs associated with the SE run turned red in the history, indicating that they had failed. The same issue eventually occurred with the PE run. When I clicked on the jobs, the error message was:
This job was terminated because it used more memory than it was allocated.
Please click the bug icon to report this problem if you need help.
When I check the info associated with the run, however, it shows all of the options I had set, and none of them refer to the amount of memory to allocate.
I googled the error, and found this post, which makes it pretty clear that the dataset is too large for the memory allocated to all free users of the Galaxy Main server. Essentially, the recommendation from the Galaxy admins is to move to using the Galaxy Cloudman server, with AWS support. I'm not sure I'm ready to do this right now, maybe the better option is to split/downsize the read files, and try again? Unfortunately, I don't see any tools available to do this on the Galaxy Main page;
x = 'x'
blah = 'blah_x'
if 'z' or 'y' in blah:
print(blah)
gi_hist = gi.histories.get_current_history()['id']
gi.histories.get_status(gi_hist)
# gi.jobs.get_state()
# gi.workflows.export_workflow_dict()
errors = gi.histories.get_current_history()['state_ids']['error']
jobs = gi.jobs.get_jobs()
running_jobs = []
errored_jobs = []
for d in jobs:
if d['state'] == 'running':
running_jobs.append(d['id'])
elif d['state'] == 'error':
errored_jobs.append(d['id'])
# gi.jobs.show_job(errored_jobs[0])
gi.jobs.show_job(running_jobs[0])
dir(gi.genomes)
for gen in gi.genomes.get_genomes():
if 'GRCh' in gen[0]:
print(gen)
# gi.genomes.show_genome('hg38Patch11')
# gi.genomes.install_genome(source='URL', url_dbkey='ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/Gnomon_mRNA.fsa.gz')
# gi.libraries.create_library('DCM_BMap')
# gi_hist = gi.histories.get_current_history()['id']
# gi.histories.upload_dataset_from_library(gi_hist, 'GRCh38.p11 Jun. 2017 (hg38Patch11)')
gi.genomes.show_genome('GRCh38.p11 Jun. 2017 (hg38Patch11)')
gi_hist
dir(gi.libraries)
for gen in gi.histories.show_history(gi_hist, contents=True):
print('{}: {}'.format(gen['id'], gen['name']))
# gi.histories.show_history(gi_hist)
gnomon = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/Gnomon_mRNA.fsa.gz'
# gi.tools.put_url(gnomon, gi_hist)
gi.tools.url()
kwargs = {'blah': '1', 'blah2': 2}
def test(blah=None, blah2=None):
if kwargs:
for k, v in kwargs.items():
print(k, v)
k = v
print(blah)
test(**kwargs)
local_transcriptome = os.path.join(paths['data_dir'], 'Gnomon_mRNA.fsa.gz')
gi.tools.upload_file(local_transcriptome, gi_hist)
gi.datasets.show_dataset('bbd44e69cb8906b545b68d86e45e2c36')
%%time
one_layer_deep = {}
for k, v in folder_ids.items():
types = []
for id_ in v:
output = gi.folders.show_folder(id_, contents=True)
for d in output['folder_contents']:
types.append(d['type'])
one_layer_deep[k] = set(types)
# folder_ids.items()
one_layer_deep
gi.folders.show_folder(*folder_ids['Illumina BodyMap 2.0'], contents=True)
gi.datasets.download_dataset('f6a3bd151a3a8748', )
dir(gi.ftpfiles)
# gi.ftpfiles.get_ftp_files()
print(gi.libraries.show_library(libs[2][1]))
print(libs[2])
# gi.datasets.show_dataset('f6a3bd151a3a8748', hda_ldda='ldda')
dir(gi.datasets)
folder_ids['Illumina BodyMap 2.0']
gi.folders.show_folder('Ff1e7bb3cd7a0f387', contents=True)
# gi.folders.get_folders('Ff1e7bb3cd7a0f387')
# dir(gi.folders)
gi.libraries.get_folders('00b73c7735c2f000')
# gi.libraries.get_folders('00b73c7735c2f000', folder_id='Fcad2bae1b5da4dc0')
stats.describe(np.array(lens))
lens_S = pd.Series(lens)
print('{}\n\n{}'.format(lens_S.describe(), lens_S.value_counts()))
print(gi.folders.show_folder('Fd795f6d3e169879a', contents=False))
print(gi.genomes.show_genome('hg38Patch11').keys())
# print(gi.genomes.show_genome('hg38Patch11')['chrom_info'])
# gi.folders.show_folder()
print(gi.libraries.get_folders('8bb3ab7690e13de8'))
print(gi.folders.show_folder('Fd795f6d3e169879a'))
So the question is, how to access the individual files within the folder above, and create a workflow where they get processed...
for d in dir(gi):
if d[0] != '_':
print(d)
dir(gi.datasets)
gi.datasets.show_dataset()
gi.workflows.get_workflows()
dir(gi.workflows)
for d in gi.libraries.get_libraries():
print('{}: {} {:<30} {}'.format(d['id'], d['public'], d['name'], d['description']))
bmap_lib = gi.libraries.get_libraries('8bb3ab7690e13de8')
bmap_lib
dir(gi)
dir(gi.libraries)
gi.libraries.get_libraries()
# gi.libraries.show_dataset('8bb3ab7690e13de8', 'a24fb72e059884e2')
folders = gi.libraries.get_folders('8bb3ab7690e13de8')
for fold in folders:
print(type(fold))
folders
folder = gi.libraries.show_folder(library_id='8bb3ab7690e13de8', folder_id='Fd795f6d3e169879a')
folder
# dir(gi.folders.show_folder('Fd795f6d3e169879a', contents=True))
# gi.folders.show_folder('Fd795f6d3e169879a', contents=True)
# gi.datasets.show_dataset('bd99481f96cf6512')
# dir(folder)
folders_dict = gi.folders.show_folder('Fd795f6d3e169879a', contents=True)
for d in folders_dict['folder_contents']:
if d['name'][-13:] == 'thyroid.fastq':
print(d)
gi.histories.upload_dataset_from_library(, 'f6a3bd151a3a8748')
dir(gi.histories)
gi.histories.get_current_history()
hist_id = gi.histories.get_current_history()['id']
dir(gi.histories)
# gi.histories.show_dataset_collection(hist_id)
gi.histories.upload_dataset_from_library(hist_id, '1f711917a9c0c715')
# folders_dict
gi.datasets.show_dataset('a041b331babfb419')
hs_chr = 'GRCh38'
dir(gi.genomes)
for gen in gi.genomes.get_genomes():
if 'hg38' in gen[1]:
print(gen)
# if hs_chr in gen[0]:
# print(gen)
# print(gen)
# if 'sapiens' in gen[0]:
# print(gen)
# if gen[0][:4] == 'Homo':
# print(gen)
gi.genomes.get_genomes()
dir(gi.genomes)
gi.genomes.show_genome('hg38Patch11')
gi.genomes.install_genome(ncbi_name='hg38Patch11')
dir(gi.folders)
# gi.folders.create_folder(parent_folder_id=None, name='bmap')
dir(gi.histories)
dir(gi.genomes)
gi.workflows.get_workflows()
dir(gi)
dir(gi.datasets)
# dir(gi.datasets.show_dataset)
# gi.datasets.show_dataset('a041b331babfb419')
gi.libraries.show_library('a041b331babfb419')
gi.folders.show_folder('Fb778b6da6fbd6bdf', contents=True)
from bioblend.galaxy.objects import GalaxyInstance
gi = GalaxyInstance('https://usegalaxy.org/', galaxy_api)
# wf = gi.workflows.list()[0]
wf = gi.workflows.list()
hist = gi.histories.list()[0]
inputs = hist.get_datasets()[:2]
input_map = dict(zip(wf.input_labels, inputs))
dir(gi.workflows)
gi.workflows.list()
description1 = 'This will be the main folder, within the root dir, to hold data \
for DCM\'s analysis of the Illumina BodyMap Data on a Galaxy Main instance.'
gi.folders.create_folder(parent_folder_id=None, name='bodymap', description=description1)
description2 = 'This will be the folder to hold the H. sapiens GRCh38 reference genome.'
gi.folders.create_folder(parent_folder_id='bodymap', name='hs_chr', description=description2)
description3 = 'This folder will hold the raw Illumina BodyMap reads from the NCBI SRA.'
gi.folders.create_folder(parent_folder_id='bodymap', name='bmap_raw', description=description3)
dir(gi)
complex_ = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\map_2018-07-02T20_43_53\NC_000020_11_ERR030871.sam'
print(os.path.basename(complex_))
Incidentally, since this is the first real use of the Ubuntu bash on Windows for me, I found that it didn't have tab autocompletion activated, so I followed these instructions, modifying C:\Users\DMacKellar\AppData\Local\lxss\root.bashrc with Notepad++ to uncomment the last 3 lines. Now it works, but it beeps annoyingly whenever the automcompletion doesn't execute perfectly. I'll follow these instructions to attempt to remedy that. I used vim from within the Ubuntu terminal to uncomment line 21 in /etc/inputrc. ...Ok, that got it.
A couple more points:
When it comes to copying terminal output to the Windows clipboard (for the all-important copying error tracebacks and pasting them into Google), I had to use this, clicking the icon in the upper-left of the Bash terminal window, selecting Properties -> Options -> Quick Edit Mode. Hopefully this doesn't reset upon re-opening Bash (it doesn't appear to right now).
When it comes to pasting from the Windows clipboard TO the terminal, Ctrl+V doesn't work, but right-clicking the trackpad works.
I still was getting an error when attempting to launch FastQC; something about Java needing X windows support:
Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:204)
at java.awt.Window.<init>(Window.java:536)
at java.awt.Frame.<init>(Frame.java:420)
at java.awt.Frame.<init>(Frame.java:385)
at javax.swing.JFrame.<init>(JFrame.java:189)
at uk.ac.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:63)
at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:332)
This site offered help for that, saying to use Xming. That program sounded familiar from my Harvard days, so I checked and found that it's still installed on the PC. After activating it, attempting to launch FastQC still yielded the same error, but I then entered into the terminal:
export DISPLAY=:0
and tried again and now it launches!
But... more complications. FastQC doesn't have access to the Windows directory structure, where I left the fastq files. I need to figure out where in the Windows dir system the linux directory structure is being mounted.
Ah, ok: the structure within the Ubuntu shell mirrors that of Windows, from the (within-Ubuntu) location of '/mnt/c/'; i.e., within that dir, Ubuntu can access anything on the C:\ drive within Windows. As this post says, the actual location of that within the Windows system is at 'C:\Users\DMacKellar\AppData\Local\lxss\rootfs', but it's not recommended to make drastic changes from within the Windows end. In other words, the best way to access the fastq files is to point the Ubuntu terminal at '/mnt/c/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq'.
When I try to open Galaxy2-[adrenal_1.fastq].fastqsanger, however, the X-window for Fastqc says 'Read 48000 sequences (95%)', and doesn't seem to change over time. TaskManager seems to indicate that it isn't taking up much resources, so it may be stuck, or it may need more time, or it may be that it won't work loading such a file from within a window. The Ubuntu terminal is outputting warnings and errors as a result of the open X window (whether from just loading the program, or due to processes launched upon attempting to open the fastq file, I'm not certain), saying that it can't find various files in '/etc/fastqc/Configuration/'.
I wonder if the command line would be better. It sounds like to run command line mode, you just call the program and then follow immediately with the fastq file you want to analyze. I killed the process so I could try again with this approach:
fastqc /mnt/c/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger
Ok, so that at least told me:
Failed to process file Galaxy2-[adrenal_1.fastq].fastqsanger
java.lang.IllegalArgumentException: No key called gc_sequence:ignore in the config data
Googling that, the first hit says that another user fixed this error by copying the contents of some other Fastqc config dir over to /etc/fastqc dir, but I can't find the configuration location to which they're referring. Doing a 'find' command for fastqc returns:
./usr/bin/fastqc
./usr/share/applications/fastqc.desktop
./usr/share/doc/fastqc
./usr/share/fastqc
./usr/share/icons/hicolor/32x32/apps/fastqc_icon.png
./usr/share/java/fastqc-0.11.4.jar
./usr/share/java/fastqc.jar
./usr/share/man/man1/fastqc.1.gz
./var/cache/apt/archives/fastqc_0.11.4+dfsg-3_all.deb
./var/lib/dpkg/info/fastqc.list
./var/lib/dpkg/info/fastqc.md5sums
The only likely dirs here are /usr/bin/fastqc/, /usr/share/doc/fastqc/, and /usr/share/fastqc/, and none of them contain a Config file. Digging a bit deeper, it sounds like other users are surprised about the installation they get when they go through 'sudo apt-get install fastqc'. Apparently most such installs aren't expected to go to '/etc/'. The docs say it's a Java program, and so should have a minimal footprint, and run out of a simple zip. I'll try:
(within Ubuntu terminal):
cd ~
sudo apt-get remove fastqc
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
chmod +x ~/FastQC/fastqc
cd /mnt/c/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/
~/FastQC/fastqc Galaxy2-[adrenal_1.fastq].fastqsanger
Ok, now it works lickety-split. Even launching the GUI in an X-window, navigating to the dir with the fastq files in it, and executing it analyzes the data very rapidly. The sequences look pretty poor, and definitely in need of trimming: